I am currently working on postgres and below is the question that I have.
We have a customer ID and the date when the person visited a property. Based on this I need to calculate the number of trips. Consecutive dates are considered as one trip. Eg: If a person visits on first date the trip no is first, post that he visits consecutively for three days that will counted as trip two.
Below is the input
ID Date
1 1-Jan
1 2-Jan
1 5-Jan
1 1-Jul
2 1-Jan
2 2-Feb
2 5-Feb
2 6-Feb
2 7-Feb
2 12-Feb
Expected output
ID Date Trip no
1 1-Jan 1
1 2-Jan 1
1 5-Jan 2
1 1-Jul 3
2 1-Jan 1
2 2-Feb 2
2 5-Feb 3
2 6-Feb 3
2 7-Feb 3
2 12-Feb 4
I am able to implement successfully using loop but its running very slow given the volume of the data.
Can you please suggest a workaround where we can not use loop.
Subtract a sequence from the dates -- these will be constant for a particular trip. Then you can use dense_rank() for the numbering:
select t.*,
dense_rank() over (partition by id order by grp) as trip_num
from (select t.*,
(date - row_number() over (partition by id order by date) * interval '1 day'
) as grp
from t
) t;
Related
My table of interview candidates has three columns and looks like this (attempt is what I want to calculate):
candidate_id
interview_stage
stage_reached_at
attempt <- want to calculate
1
1
2019-01-01
1
1
2
2019-01-02
1
1
3
2019-01-03
1
1
1
2019-11-01
2
1
2
2019-11-02
2
1
1
2021-01-01
3
1
2
2021-01-02
3
1
3
2021-01-03
3
1
4
2021-01-04
3
The table represents candidate_id 1 who has had 3 separate interview attempts at a company.
Made it to interview_stage 3 on the 1st attempt
Made it to interview_stage 2 on the 2nd attempt
Made it to interview_stage 4 on the 3d attempt
Question: Can I somehow use the number series if I order by stage_reached_at? As soon as the next step for a particular candidate_id is lower than the row before, I know it's a new process.
I want to be able to group on candidate_id and process_grouping at the end of the day.
Thx in advance.
You can use lag() and then a cumulative sum:
select t.*,
sum(case when prev_interview_stage >= interview_stage then 1 else 0 end) over (partition by candidate_id order by stage_reached_at) as attempt
from (select t.*,
lag(interview_stage) over (partition by candidate_id order by stage_reached_at) as prev_interview_stage
from t
) t;
Note: Your question specifically says "lower". I wonder, though, if you really mean "lower or equal to". If the latter, change the >= to >.
Using example below, Day 1 will have 1,3,3 distinct name(s) for A,B,C respectively.
When calculating distinct name(s) for each house on Day 2, data up to Day 2 is used.
When calculating distinct name(s) for each house on Day 3, data up to Day 3 is used.
Can recursive cte be used?
Data:
Day
House
Name
1
A
Jack
1
B
Pop
1
C
Anna
1
C
Dew
1
C
Franco
2
A
Jon
2
B
May
2
C
Anna
3
A
Jon
3
B
Ken
3
C
Dew
3
C
Dew
Result:
Day
House
Distinct names
1
A
1
1
B
1
1
C
3
2
A
2 (jack and jon)
2
B
2
2
C
3
3
A
2 (jack and jon)
3
B
3
3
C
3
Without knowing the need and size of data it'll be hard to give an ideal/optimal solution. Assuming a small dataset needing a quick and dirty way to calculate, just use sub query like this...
SELECT p.[Day]
, p.House
, (SELECT COUNT(DISTINCT([Name]))
FROM #Bing
WHERE [Day]<= p.[Day] AND House = p.House) DistinctNames
FROM #Bing p
GROUP BY [Day], House
ORDER BY 1
There is no need for a recursive CTE. Just mark the first time a name is seen in a house and use a cumulative sum:
select day, house,
sum(sum(case when seqnum = 1 then 1 else 0 end)) over (partition by house order by day) as num_unique_names
from (select t.*,
row_number() over (partition by house, name order by day) as seqnum
from t
) t
group by day, house
I have a dataset as shown below, wondering how I can do a rolling average with its current record followed by next two records. Example: lets consider the first record whose total is 3 followed by 4 and 7 ,Now the rolling 3 day average for first record would be 4.6 and so on.
Date Total
1 3
2 4
3 7
4 1
5 2
6 4
Expected output:
Date Total 3day_rolling_Avg
1 3 4.6
2 4 4
3 7 3.3
4 1 2.3
5 2 null
6 4 null
PS: Having "null" value isn't important. This is just a sample data where I need to look at more than 3 days(Ex: 30 days rolling)
I think that the simplest approach is a window avg(), with the poper window frame:
select
t.*,
avg(total)
over(order by date rows between current row and 2 following) as "3d_rolling_avg"
from mytable t
If you want to return a null value when there is less than 2 leading rows, as show in your expected results, then you can use row_number() on top of it:
select
t.*,
case when rank() over(order by date desc) <= 2
then avg(total)
over(order by date rows between current row and 2 following)
end as "3d_rolling_avg"
from mytable t
As I am preparing my data for predicting no-shows at a hospital, I ran into the following problem: In the query below I tried to get the number of shows/no-shows relatively shown to the number of appointments (APPTS). INDICATION_NO_SHOW means whether a patient showed up at a appointment. 0 means show, and 1 means no-show.
with t1 as
(
select
PAT_ID
,APPT_TIME
,APPT_ID
,ROW_NUMBER () over(PARTITION BY PAT_ID order by pat_id,APPT_TIME) as [TOTAL_APPTS]
,INDICATION_NO_SHOW
from appointments
)
,
t2 as
(
t1.PAT_ID
,t1.APPT_TIME
,INDICATION_NO_SHOW
,sum(INDICATION_NO_SHOW) over(order by PAT_ID, APPT_TIME ) as TOTAL_NO_SHOWS
,TOTAL_APPT
from t1
)
SELECT *
,(TOTAL_APPT- TOTAL_NO_SHOWS) AS TOTAL_SHOWS
FROM T2
order by PAT_ID, APPT_TIME
This resulted into the following dataset:
PAT ID APPT_TIME INDICATION_NO_SHOW TOTAL_SHOWS TOTAL_NO_SHOWS TOTAL_APPTS
1 1-1-2001 0 1 0 1
1 1-2-2001 0 2 0 2
1 1-3-2001 1 2 1 3
1 1-4-2001 0 3 1 4
2 1-1-2001 0 0 1 1
2 2-1-2001 0 1 1 2
2 2-2-2001 1 1 2 3
2 2-3-2001 0 2 2 4
As you can see my query only worked for patient 1, and then it also counts the no-shows for patient 1 for patient 2. So individually it worked for 1 patient, but not over the whole dataset.
The TOTAL_APPTs column worked out, because it counted the number of appts the patient had at the moment of that given appt. My question is: How do I succesfully get these shows and no-shows succesfully added up (as I did for patient 1)? I'm completely aware why this query doesn't work, I'm just completely in the blue on how to fix it..
I think that you can just use window functions. You seem to be looking for window sums of shows and no shows per patient, so:
select
pat_id,
appt_time,
indication_no_show,
sum(1 - indication_no_show)
over(partition by pat_id order by appt_time) total_shows,
sum(indication_no_show)
over(partition by pat_id order by appt_time) total_no_shows
from appointments
What I would like to do is find the number of consecutive weeks that someone is active on Sundays and assign them a value. They have to participate in at least 2 races a day to be counted as active for the week.
If they are active for 2 consecutive weeks I would like to assign a value of 100, 3 consecutive weeks a value of 200, 4 consecutive weeks a value of 300, and continuing up to 9 consecutive weeks.
My difficulty is not determining consecutive weeks, but breaks in between consecutive dates. Suppose the following dataset:
CustomerID RaceDate Races
1 2/2/2014 2
1 2/9/2014 5
1 2/16/2014 3
1 2/23/2014 3
1 3/2/2014 4
1 3/9/2014 3
1 3/16/2014 3
2 2/2/2014 2
2 2/9/2014 3
2 3/2/2014 2
2 3/9/2014 4
2 3/16/2014 3
CustomerID 1 would have 7 consecutive weeks for a value of 600.
The hard part for me is CustomerID 2. They would have 2 consecutive weeks AND 3 consecutive weeks. So their total value would be 100 + 200 = 300.
I would like to be able to do this with any different combination of consecutive weeks.
Any help please?
EDIT: I am using SQL Server 2008 R2.
When looking for sequential values, there is a simple observation that helps. If you subtract a sequence from the dates then the value is a constant. You can use this as a grouping mechanism
select CustomerId, min(RaceDate) as seqStart, max(RaceDate) as seqEnd,
count(*) as NumDaysRaced
from (select t.*,
dateadd(week, - row_number() over (partition by customerID, RaceDate),
RaceDate) as grp
from table t
where races >= 2
) t
group by CustomerId, grp;
You can then use this to get your final "points":
select CustomerId,
sum(case when NumDaysRaced > 1 then (NumDaysRaced - 1) * 100 else 0 end) as Points
from (select CustomerId, min(RaceDate) as seqStart, max(RaceDate) as seqEnd,
count(*) as NumDaysRaced
from (select t.*,
dateadd(week, - row_number() over (partition by customerID, RaceDate),
RaceDate) as grp
from table t
where races >= 2
) t
group by CustomerId, grp
) c
group by CustomerId;