Find first and last trip with same source-destination by day - sql

I have a table with source, dest, and time of the trip. I want to find list of all the sources that had same destination for the first and last trip of the day. Table looks like below:
Source Dest Trip_Time
1 2 2/1/2019 6:00
2 3 2/1/2019 7:00
4 2 2/1/2019 7:00
1 3 2/1/2019 8:00
2 1 2/1/2019 9:00
3 1 2/1/2019 9:00
4 1 2/1/2019 9:00
1 4 2/1/2019 15:00
2 1 2/1/2019 17:30
3 5 2/1/2019 17:30
4 5 2/1/2019 17:30
2 3 2/1/2019 19:45
3 1 2/1/2019 19:45
5 2 2/1/2019 19:45
1 4 2/2/2019 17:00
1 3 2/2/2019 21:00
I have figured out a query to get what I wanted, but I was wondering if there is more optimal way of achieving the result, especially the one that'll work with millions of rows.
select source, max(first_trip) ft, max(last_trip) lt from
(select source, case when (a.max) = 1 then (dest) end as first_trip,
case when (a.min) = 1 then (dest) end as last_trip from (select source, dest, time_trip,
Row_Number() Over (partition by source order by time_trip desc) as max,
Row_Number() Over (partition by source order by time_trip asc) as min from trips) a
where a.max = 1 or a.min = 1) b
group by b.source) c where ft = lt```
Expected result:
caller fc lc
2 3 3
3 1 1
5 2 2

One method is to use first_value() and last_value(). Date functions are notoriously dependent on the database, but you need to extract the date from the trip_time. The following illustrates the idea but the date function could differ on your database:
select distinct source, cast(trip_time as date)
from (select t.*,
first_value(dest) over (partition by source, cast(trip_time as date) order by trip_time asc) as first_dest,
first_value(dest) over (partition by source, cast(trip_time as date) order by trip_time desc) as last_dest
from trips t
) t
where first_dest = last_dest;

Related

How do you get the last entry for each month in SQL?

I am looking to filter very large tables to the latest entry per user per month. I'm not sure if I found the best way to do this. I know I "should" trust the SQL engine (snowflake) but there is a part of me that does not like the join on three columns.
Note that this is a very common operation on many big tables, and I want to use it in DBT views which means it will get run all the time.
To illustrate, my data is of this form:
mytable
userId
loginDate
year
month
value
1
2021-01-04
2021
1
41.1
1
2021-01-06
2021
1
411.1
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-06
2021
2
32
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
And I'm trying to use SQL to get the last value (by loginDate) for each month.
I'm currently doing a groupby & a join as follows:
WITH latest_entry_by_month AS (
SELECT "userId", "year", "month", max("loginDate") AS "loginDate"
FROM mytable
)
SELECT * FROM mytable NATURAL JOIN latest_entry_by_month
The above results in my desired output:
userId
loginDate
year
month
value
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
But I'm not sure if it's optimal.
Any guidance on how to do this faster? Note that I am not materializing the underlying data, so it is effectively un-clustered (I'm getting it from a vendor via the Snowflake marketplace).
Using QUALIFY and windowed function(ROW_NUMBER):
SELECT *
FROM mytable
QUALIFY ROW_NUMBER() OVER(PARTITION BY userId, year, month
ORDER BY loginDate DESC) = 1

Calculate Churn by aggregating by date range in SQL

I am trying to calculate the churn rate from a data that has customer_id, group, date. The aggregation is going to be by id, group and date. The churn formula is (customers in previous cohort - customers in last cohort)/customers in previous cohort
customers in previous cohort refers to cohorts in before 28 days
customers in last cohort refers to cohorts in last 28 days
I am not sure how to aggregate them by date range to calculate the churn.
Here is sample data that I copied from SQL Group by Date Range:
Date Group Customer_id
2014-03-01 A 1
2014-04-02 A 2
2014-04-03 A 3
2014-05-04 A 3
2014-05-05 A 6
2015-08-06 A 1
2015-08-07 A 2
2014-08-29 XXXX 2
2014-08-09 XXXX 3
2014-08-10 BB 4
2014-08-11 CCC 3
2015-08-12 CCC 2
2015-03-13 CCC 3
2014-04-14 CCC 5
2014-04-19 CCC 4
2014-08-16 CCC 5
2014-08-17 CCC 3
2014-08-18 XXXX 2
2015-01-10 XXXX 3
2015-01-20 XXXX 4
2014-08-21 XXXX 5
2014-08-22 XXXX 2
2014-01-23 XXXX 3
2014-08-24 XXXX 2
2014-02-25 XXXX 3
2014-08-26 XXXX 2
2014-06-27 XXXX 4
2014-08-28 XXXX 1
2014-08-29 XXXX 1
2015-08-30 XXXX 2
2015-09-31 XXXX 3
The goal is to calculate the churn rate every 28 days in between 2014 and 2015 by the formula given above. So, it is going to be aggregating the data by rolling it by 28 days and calculating the churn by the formula.
Here is what I tried to aggregate the data by date range:
SELECT COUNT(distinct customer_id) AS count_ids, Group,
DATE_SUB(CAST(Date AS DATE), INTERVAL 56 DAY) AS Date_min,
DATE_SUB(CURRENT_DATE, INTERVAL 28 DAY) AS Date_max
FROM churn_agg
GROUP BY count_ids, Group, Date_min, Date_max
Hope someone will help me with aggregation and churn calculation. I want to simply deduct the aggregated count_ids to deduct it from the next aggregated count_ids which is after 28 days. So this is going to be successive deduction of the same column value (count_ids). I am not sure if I have to use rolling window or simple aggregation to find the churn.
As corrected by #jarlh, it's not 2015-09-31 but 2015-09-30
You can use this to create 28 days calendar:
create table daysby28 (i int, _Date date);
insert into daysby28 (i, _Date)
SELECT i, cast('01-01-2014'as date) + i*INTERVAL '28 day'
from generate_series(0,50) i
order by 1;
After you use #jarlh churn_agg table creation he sent with the fiddle, with this query, you get what you want:
with cte as
(
select count(Customer) as TotalCustomer, Cohort, CohortDateStart From
(
select distinct a.Customer_id as Customer, b.i as Cohort, b._Date as CohortDateStart
from churn_agg a left join daysby28 b on a._Date >= b._Date and a._Date < b._Date + INTERVAL '28 day'
) a
group by Cohort, CohortDateStart
)
select a.CohortDateStart,
1.0*(b.TotalCustomer - a.TotalCustomer)/(1.0*b.TotalCustomer) as Churn from cte a
left join cte b on a.cohort > b.cohort
and not exists(select 1 from cte c where c.cohort > b.cohort and c.cohort < a.cohort)
order by 1
The fiddle of all together is here

Restart Row Number Based on Date and Increments of N

I have a table with the following data that I generated with a date table
date day_num (DAY_NUM % 7)
2019-07-09 0 0
2019-07-10 1 1
2019-07-11 2 2
2019-07-12 3 3
2019-07-13 4 4
2019-07-14 5 5
2019-07-15 6 6
2019-07-16 7 0
I basically want to get a week number that restarts at 0 and I need help figuring out the last part
The final output would look like this
date day_num (DAY_NUM % 7) week num
2019-07-09 0 0 1
2019-07-10 1 1 1
2019-07-11 2 2 1
2019-07-12 3 3 1
2019-07-13 4 4 1
2019-07-14 5 5 1
2019-07-15 6 6 1
2019-07-16 7 0 2
This is the sql I have so far
select
SUB.*,
DAY_NUM%7
FROM(
SELECT
DISTINCT
id_date,
row_number() over(order by id_date) -1 as day_num
FROM schema.date_tbl
WHERE Id_date BETWEEN "2019-07-09" AND date_add("2019-07-09",146)
Building on your query:
select SUB.*, DAY_NUM%7,
DENSE_RANK() OVER (ORDER BY FLOOR(DAY_NUM / 7)) as weeknum
FROM (SELECT DISTINCT id_date,
row_number() over(order by id_date) -1 as day_num
FROM schema.date_tbl
WHERE Id_date BETWEEN "2019-07-09" AND date_add("2019-07-09", 146)
) x

Creating a new calculated column in SQL

Is there a way to find the solution so that I need for 2 days, there are 2 UD's because there are June 24 2 times and for the rest there are single days.
I am showing the expected output here:
Primary key UD Date
-------------------------------------------
1 123 2015-06-24 00:00:00.000
6 456 2015-06-24 00:00:00.000
2 123 2015-06-25 00:00:00.000
3 658 2015-06-26 00:00:00.000
4 598 2015-06-27 00:00:00.000
5 156 2015-06-28 00:00:00.000
No of times Number of days
-----------------------------
4 1
2 2
The logic is 4 users are there who used the application on 1 day and there are 2 userd who used the application on 2 days
You can use two levels of aggregation:
select cnt, count(*)
from (select date, count(*) as cnt
from t
group by date
) d
group by cnt
order by cnt desc;

Creating a timetable with SQL (calculated start times for slots) and filtering by a person to show them their slots

I'm working in iMIS CMS (iMIS 200) and trying to create an IQA (an iMIS query, using SQL) that will give me a timetable of slots assigned to people per day (I've got this working); but then I want to be able to filter that timetable on a person's profile so they just see the slots they are assigned to.
(This is for auditions for an orchestra. So people make an application per instrument, then those applications are assigned to audition slots, of which there are several slots per day)
As the start/end times for slots are calculated using SUM OVER, when I filter this query by the person ID, I lose the correct start/end times for slots (as the other slots aren't in the data for it to SUM, I guess!)
Table structure:
tblContacts
===========
ContactID ContactName
---------------------------
1 Steve Jones
2 Clare Philips
3 Bob Smith
4 Helen Winters
5 Graham North
6 Sarah Stuart
tblApplications
===============
AppID FKContactID Instrument
-----------------------------------
1 1 Violin
2 1 Viola
3 2 Cello
4 3 Cello
5 4 Trumpet
6 5 Clarinet
7 5 Horn
8 6 Trumpet
tblAuditionDays
===============
AudDayID AudDayDate AudDayVenue AudDayStart
-------------------------------------------------
1 16-Sep-19 London 10:00
2 17-Sep-19 Manchester 10:00
3 18-Sep-19 Birmingham 13:30
4 19-Sep-19 Leeds 10:00
5 19-Sep-19 Glasgow 11:30
tblAuditionSlots
================
SlotID FKAudDayID SlotOrder SlotType SlotDuration FKAppID
-----------------------------------------------------------------
1 1 1 Audition 20 3
2 1 2 Audition 20 4
3 1 3 Chat 10 3
4 1 5 Chat 10 4
5 1 4 Audition 20
6 2 1 Audition 20 1
7 2 2 Audition 20 6
8 2 4 Chat 10 6
9 2 3 Chat 10 1
10 2 5 Audition 20
11 3 2 Chat 10 8
12 3 1 Audition 20 2
13 3 4 Chat 5 2
14 3 3 Audition 20 8
15 5 1 Audition 30 5
16 5 2 Audition 30 7
17 5 3 Chat 15 7
18 5 4 Chat 15 5
Current SQL for listing all the slots each day (in date/slot order, with the slot timings calculcated correctly) is:
SELECT
[tblAuditionSlots].[SlotOrder] as [Order],
CASE
WHEN
SUM([tblAuditionSlots].[SlotDuration]) OVER (PARTITION BY [tblAuditionDays].[FKAudDayID] ORDER BY [tblAuditionSlots].[SlotOrder] ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) is null
THEN
CONVERT(VARCHAR(5), [tblAuditionDays].[AudDayStart], 108)
ELSE
CONVERT(VARCHAR(5), Dateadd(minute, SUM([tblAuditionSlots].[SlotDuration]) OVER (PARTITION BY [tblAuditionDays].[FKAudDayID] ORDER BY [tblAuditionSlots].[SlotOrder] ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), [tblAuditionDays].[AudDayStart]), 108)
END
+ ' - ' +
CASE
WHEN
SUM([tblAuditionSlots].[SlotDuration]) OVER (PARTITION BY [tblAuditionDays].[FKAudDayID] ORDER BY [tblAuditionSlots].[SlotOrder] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) is null
THEN
CONVERT(VARCHAR(5), [tblAuditionDays].[AudDayStart], 108)
ELSE
CONVERT(VARCHAR(5), Dateadd(minute, SUM([tblAuditionSlots].[SlotDuration]) OVER (PARTITION BY [tblAuditionDays].[FKAudDayID] ORDER BY [tblAuditionSlots].[SlotOrder] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), [tblAuditionDays].[AudDayStart]), 108)
END AS [Slot],
[tblAuditionSlots].[SlotType] AS [Type],
[tblContacts].[ContactName] as [Name],
FROM
tblAuditionSlots
LEFT JOIN tblAuditionDays ON tblAuditionSlots.FKAudDayID = tblAuditionDays.AudDayID
LEFT JOIN tblApplications ON tblAuditionSlots.FKAppID = tblApplications.AppID
LEFT JOIN tblContacts ON tblApplications.FKContactID = tblContacts.ContactID
GROUP BY
[tblAuditionSlots].[SlotOrder],
[tblAuditionSlots].[SlotType],
[tblAuditionSlots].[SlotDuration],
[tblAuditionDays].[AudDayStart],
[tblContacts].[ContactName],
[tblContacts].[ContactID],
[tblAuditionDays].[AudDayID],
[tblAuditionDays].[AudDayDate]
ORDER BY
[tblAuditionDays].[DayDate],
[tblAuditionSlots].[Order]
iMIS, the CMS we're using, is limited by what you can create in an IQA (query).
You can basically insert (some) SQL as a column and give it an alias; you can add (non-calculated) fields to the order by; you can't really control the Group By (whatever fields are added are included in the Group By).
Ultimately, I'd like to be able to filter this by a Contact ID so I can see all their audition slots, but with the times correctly calculated.
From the sample data, for example:
STEVE JONES AUDITIONS
=====================
Date Slot Venue Type Instrument
----------------------------------------------------------------
17-Sep-19 10:00 - 10:20 Manchester Audition Violin
17-Sep-19 10:40 - 10:50 Manchester Chat Violin
18-Sep-19 13:30 - 13:50 Birmingham Audition Viola
18-Sep-19 14:30 - 14:35 Birmingham Chat Viola
HELEN WINTERS AUDITIONS
=======================
Date Slot Venue Type Instrument
----------------------------------------------------------------
19-Sep-19 11:30 - 12:00 Glasgow Audition Trumpet
19-Sep-19 12:45 - 13:00 Glasgow Chat Trumpet
Hopefully that all makes sense and I've provided enough information.
(In this version of iMIS [200], you can't do subqueries, in case that comes up...)
Thanks so much in advance for whatever help/tips/advice you can offer!
Chris