SQL (Snowflake) – 30 day look ahead / rolling average - sql

I have two tables – purchases and activity.
The purchase table is structured like so:
|----------|----------------|----------|
| user_id | purchase_date | status |
|----------|----------------|----------|
| 1234 | 2020-01-01 | active |
|----------|----------------|----------|
| 2345 | 2020-01-10 | cancelled|
The activity table is structured like so:
|----------|----------------|-----------------|
| user_id | date | videos_viewed |
|----------|----------------|-----------------|
| 1234 | 2020-01-02 | 4 |
|----------|----------------|-----------------|
| 2345 | 2020-01-03 | 3 |
|----------|----------------|-----------------|
| 2345 | 2020-01-10 | 10 |
|----------|----------------|-----------------|
| 2345 | 2020-01-11 | 7 |
I am looking to query out a first 30 day activity average for each users' first 30 days based on a set purchase period.
The query I have written so far is this:
SELECT avg(t3.viewsperday)
FROM
(SELECT
date
,sum(t1.videos_viewed)/count(t1.user_id) as viewsperday
FROM activity t1
INNER JOIN (SELECT * FROM purchase c
WHERE status = 'active'
AND purchase_date BETWEEN '2020-01-01' and '2020-02-01') t2
ON t1.user_id = t2.user_id
where date between '2020-01-01' and '2020-02-01'
group by 1
order by 1 asc) as t3;
However, the problem here is that if a user purchased on 2020-01-31 I only get the first day of activity. I need help to figure out how to get the rolling average / look ahead 30 days from each purchase date – and get the average activity from those 30 days.
I suspect a window function would be appropriate here, but I am not sure how to formulate it as it is a bit outside of my knowledge. Any help would be greatly appreciated.

the following should work. I'm assuming that you want the average over 30 days even when there may have been zero views on some of those days? You may also need to adjust it slightly depending on exactly how you are defining the 30 day date range i.e. is the 30th day included, is the purchase date included, etc.
I've written it as an outer join so that even users with no views will be included
SELECT
P.USER_ID,
SUM(A.VIDEOS_VIEWED)/30
FROM PURCHASE P
LEFT OUTER JOIN ACTIVITY A ON P.USER_ID = A.USER_ID AND
A.DATE >= P.PURCHASE_DATE AND A.DATE <= dateadd(DAY, 30, P.PURCHASE_DATE)
GROUP BY P.USER_ID;
Update...
To get daily averages, try this (views on purchase date show as day 0, add 1 to the Day_after_Purchase formula if this should be day 1):
SELECT
(a.date - p.purchase_date) as Day_after_Purchase,
avg(A.VIDEOS_VIEWED)
FROM PURCHASE P
LEFT OUTER JOIN ACTIVITY A ON P.USER_ID = A.USER_ID AND
A.DATE >= P.PURCHASE_DATE AND A.DATE <= dateadd(DAY, 30, P.PURCHASE_DATE)
GROUP BY 1;

Related

SQL Day-over-Day count miscalculation

I'm encountering a bug in my SQL code that calculates the day-over-day (DoD) count difference. This table (curr_day) summarizes the count of trades on any business day (i.e. excluding weekends and government-mandated holidays) and is joined by a similar table (prev_day) that is day-lagged (previous day). The joining is based on the day's rank; for example the first day on the curr_day table is Jan-01 and it's rank is 1, the first day (rank 1) for the prev_day table is Dec-31.
My issue is that the trade count does not seem to calculate positive changes (see table below), only negative or no changes. This problem does not affect other fields that calculate the value of a trade, simply the amount of trades on a given day.
Sample of query
with curr_day as (select GROUP, COUNT from DB where DATE is not HOLIDAY),
prev_day as (select rank()over(partition by GROUP order by DATE) as RANK, GROUP, DATE, COUNT
from curr_day where DATE is not HOLIDAY)
select ID, DATE, curr_day.COUNT-prev_day.COUNT
from (select rank()over(partition by curr_day.GROUP order by curr_day.DATE) as RANK
from curr_day
where curr_day.DATE >= (select min(curr_day.DATE)+1) from curr_day)
left join prev_day on curr_day.RANK = prev_day.RANK and curr_day.GROUP = prev_day.GROUP)
;
Output table
Date | Group | Count | DoD_Cnt_Diff
2020-12-31 | A | 1 | 0
2021-01-01 | A | 1 | 0
2021-01-02 | A | 0 | -1
2021-01-03 | A | 1 | (null)
2021-01-04 | A | 0 | -1
2021-01-05 | A | 0 | 0
2021-12-31 | B | 0 | 0

Querying the retention rate on multiple days with SQL

Given a simple data model that consists of a user table and a check_in table with a date field, I want to calculate the retention date of my users. So for example, for all users with one or more check ins, I want the percentage of users who did a check in on their 2nd day, on their 3rd day and so on.
My SQL skills are pretty basic as it's not a tool that I use that often in my day-to-day work, and I know that this is beyond the types of queries I am used to. I've been looking into pivot tables to achieve this but I am unsure if this is the correct path.
Edit:
The user table does not have a registration date. One can assume it only contains the ID for this example.
Here is some sample data for the check_in table:
| user_id | date |
=====================================
| 1 | 2020-09-02 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 12:00:00 |
-------------------------------------
| 1 | 2020-09-04 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 11:00:00 |
-------------------------------------
| ... |
-------------------------------------
And the expected output of the query would be something like this:
| day_0 | day_1 | day_2 | day_3 |
=================================
| 70% | 67 % | 44% | 32% |
---------------------------------
Please note that I've used random numbers for this output just to illustrate the format.
Oh, I see. Assuming you mean days between checkins for users -- and users might have none -- then just use aggregation and window functions:
select sum( (ci.date = ci.min_date)::numeric ) / u.num_users as day_0,
sum( (ci.date = ci.min_date + interval '1 day')::numeric ) / u.num_users as day_1,
sum( (ci.date = ci.min_date + interval '2 day')::numeric ) / u.num_users as day_2
from (select u.*, count(*) over () as num_users
from users u
) u left join
(select ci.user_id, ci.date::date as date,
min(min(date::date)) over (partition by user_id order by date) as min_date
from checkins ci
group by user_id, ci.date::date
) ci;
Note that this aggregates the checkins table by user id and date. This ensures that there is only one row per date.

SQL - BigQuery - How do I fill in dates from a calendar table?

My goal is to join a sales program table to a calendar table so that there would be a joined table with the full trailing 52 weeks by day, and then the sales data would be joined to it. The idea would be that there are nulls I could COALESCE after the fact. However, my problem is that I only get results without nulls from my sales data table.
The questions I've consulted so far are:
Join to Calendar Table - 5 Business Days
Joining missing dates from calendar table Which points to
MySQL how to fill missing dates in range?
My Calendar table is all 364 days previous to today (today being day 0). And the sales data has a program field, a store field, and then a start date and an end date for the program.
Here's what I have coded:
SELECT
CAL.DATE,
CAL.DAY,
SALES.ITEM,
SALES.PROGRAM,
SALES.SALE_DT,
SALES.EFF_BGN_DT,
SALES.EFF_END_DT
FROM
CALENDAR_TABLE AS CAL
LEFT JOIN
SALES_TABLE AS SALES
ON CAL.DATE = SALES.SALE_DT
WHERE 1=1
and SALES.ITEM = 1 or SALES.ITEM is null
ORDER BY DATE ASC
What I expected was 365 records with dates where there were nulls and dates where there were filled in records. My query resulted in a few dates with null values but otherwise just the dates where a program exists.
DATE | ITEM | PROGRAM | SALE_DT | PRGM_BGN | PRGM_END |
----------|--------|---------|----------|-----------|-----------|
8/27/2020 | | | | | |
8/26/2020 | | | | | |
8/25/2020 | | | | | |
8/24/2020 | | | | | |
6/7/2020 | 1 | 5 | 6/7/2020 | 2/13/2016 | 6/7/2020 |
6/6/2020 | 1 | 5 | 6/6/2020 | 2/13/2016 | 6/7/2020 |
6/5/2020 | 1 | 5 | 6/5/2020 | 2/13/2016 | 6/7/2020 |
6/4/2020 | 1 | 5 | 6/4/2020 | 2/13/2016 | 6/7/2020 |
Date = Calendar day.
Item = Item number being sold.
Program = Unique numeric ID of program.
Sale_Dt = Field populated if at least one item was sold under this program.
Prgm_bgn = First day when item was eligible to be sold under this program.
Prgm_end = Last day when item was eligible to be sold under this program.
What I would have expected would have been records between June 7 and August 24 which just had the DATE column populated for each day and null values as what happens in the most recent four records.
I'm trying to understand why a calendar table and what I've written are not providing the in-between dates.
EDIT: I've removed the request for feedback to shorten the question as well as an example I don't think added value. But please continue to give feedback as you see necessary.
I'd be more than happy to delete this whole question or have someone else give a better answer, but after staring at the logic in some of the answers in this thread (MySQL how to fill missing dates in range?) long enough, I came up with this:
SELECT
CAL.DATE,
t.* EXCEPT (DATE)
FROM
CALENDER_TABLE AS CAL
LEFT JOIN
(SELECT
CAL.DATE,
CAL.DAY,
SALES.ITEM,
SALES.PROGRAM,
SALES.SALE_DT,
SALES.EFF_BGN_DT,
SALES.EFF_END_DT
FROM
CALENDAR_TABLE AS CAL
LEFT JOIN
SALES_TABLE AS SALES
ON CAL.DATE = SALES.SALE_DT
WHERE 1=1
and SALES.ITEM = 1 or SALES.ITEM is null
ORDER BY DATE ASC) **t**
ON CAL.DATE = t.DATE
From what I can tell, it seems to be what I needed. It allows for the subquery to connect a date to all those records, then just joins on the calendar table again solely on date to allow for those nulls to be created.

Retrieving 52 weeks after the result of a subquery

From a table that contains sales, I retrieved the last week of that table. That gives me the last week where there are sales being made. 'Date' is always the first day of the month but it doesn't matter, the real important data is week and partial_week.
The result is simple :
+------------+---------+--------------+
| Date | Week | Partial_week |
+------------+---------+--------------+
| 2020-02-01 | 2020-09 | 2020M02W09 |
+------------+---------+--------------+
Let's call it t1
I have a table with the first day of each month, every week and partial week from 2015 to 2025
(when a week is on two months, it's split in two partial weeks that have the same number but different month). It looks like this :
+------------+---------+--------------+
| Date | Week | Partial_week |
+------------+---------+--------------+
| 2020-02-01 | 2020-05 | 2020M02W05 |
| 2020-02-01 | 2020-06 | 2020M02W06 |
| 2020-02-01 | 2020-07 | 2020M02W07 |
| 2020-02-01 | 2020-08 | 2020M02W08 |
| 2020-02-01 | 2020-09 | 2020M02W09 |
| 2020-03-01 | 2020-09 | 2020M03W09 |
+------------+---------+--------------+
Let's call it t2
I now need to retrieve everything in t2 that is between 1 et 52 weeks after my week retrieved in t1. (this should get me every weeks and partial weeks until 2021-09 or so).
I tought about having a
'select top 52 distinct week from t2'
joining on t1 and having a where clause 'where t1.week < t2.week'
then joining everything on t2 again to get every partial week too,
but that doesn't work because on every week t1.week is equal to null (I wish t1.week could just be a variable since it only has one row...)
Any ideas would be appreciated.
Your logic seems to be close. Put the initial query in a Scalar Subquery to handle it like a variable:
select *
from t2
where t2.week >=
( select week from t1 -- i.e. your existing query to return the latest week
)
qualify
dense_rank()
over (order by week) <= 52
You can also switch to a join:
select *
from t2
join
( select week from t1 -- i.e. your existing query to return the latest week
) as t1
on t2.week >= t1.week
qualify
dense_rank() -- next 52 week & partial weeks
over (order by t2.week) <= 52
Explain of the Scalar Subquery might be better.

Finding correlated values from second table without resorting to PL/SQL

I have the following two tables in my database:
a) A table containing values acquired at a certain date (you may think of these as, say, temperature readings):
sensor_id | acquired | value
----------+---------------------+--------
1 | 2009-04-01 10:00:00 | 20
1 | 2009-04-01 10:01:00 | 21
1 | 2009-04 01 10:02:00 | 20
1 | 2009-04 01 10:09:00 | 20
1 | 2009-04 01 10:11:00 | 25
1 | 2009-04 01 10:15:00 | 30
...
The interval between the readings may differ, but the combination of (sensor_id, acquired) is unique.
b) A second table containing time periods and a description (you may think of these as, say, periods when someone turned on the radiator):
sensor_id | start_date | end_date | description
----------+---------------------+---------------------+------------------
1 | 2009-04-01 10:00:00 | 2009-04-01 10:02:00 | some description
1 | 2009-04-01 10:10:00 | 2009-04-01 10:14:00 | something else
Again, the length of the period may differ, but there will never be overlapping time periods for any given sensor.
I want to get a result that looks like this for any sensor and any date range:
sensor id | start date | v1 | end date | v2 | description
----------+---------------------+----+---------------------+----+------------------
1 | 2009-04-01 10:00:00 | 20 | 2009-04-01 10:02:00 | 20 | some description
1 | 2009-04-01 10:10:00 | 25 | 2009-04-01 10:14:00 | 30 | some description
Or in text from: given a sensor_id and a date range of range_start and range_end,
find me all time periods which have overlap with the date range (that is, start_date < range_end and end_date > range_start) and for each of these rows, find the corresponding values from the value table for the time period's start_date and end_date (find the first row with acquired > start_date and acquired > end_date).
If it wasn't for the start_value and end_value columns, this would be a textbook trivial example of how to join two tables.
Can I somehow get the output I need in one SQL statement without resorting to writing a PL/SQL function to find these values?
Unless I have overlooked something blatantly obvious, this can't be done with simple subselects.
Database is Oracle 11g, so any Oracle-specific features are acceptable.
Edit: yes, looping is possible, but I want to know if this can be done with a single SQL select.
You can give this a try. Note the caveats at the end though.
SELECT
RNG.sensor_id,
RNG.start_date,
RDG1.value AS v1,
RNG.end_date,
RDG2.value AS v2,
RNG.description
FROM
Ranges RNG
INNER JOIN Readings RDG1 ON
RDG1.sensor_id = RNG.sensor_id AND
RDG1.acquired => RNG.start_date
LEFT OUTER JOIN Readings RDG1_NE ON
RDG1_NE.sensor_id = RDG1.sensor_id AND
RDG1_NE.acquired >= RNG.start_date AND
RDG1_NE.acquired < RDG1.acquired
INNER JOIN Readings RDG2 ON
RDG2.sensor_id = RNG.sensor_id AND
RDG2.acquired => RNG.end_date
LEFT OUTER JOIN Readings RDG1_NE ON
RDG2_NE.sensor_id = RDG2.sensor_id AND
RDG2_NE.acquired >= RNG.end_date AND
RDG2_NE.acquired < RDG2.acquired
WHERE
RDG1_NE.sensor_id IS NULL AND
RDG2_NE.sensor_id IS NULL
This uses the first reading after the start date of the range and the first reading after the end date (personally, I'd think using the last date before the start and end would make more sense or the closest value, but I don't know your application). If there is no such reading then you won't get anything at all. You can change the INNER JOINs to OUTER and put additional logic in to handle those situations based on your own business rules.
It seems pretty straight forward.
Find the sensor values for each range. Find a row - I will call acquired of this row just X - where X > start_date and not exists any other row with acquired > start_date and acquired < X. Do the same for end date.
Select only the ranges that meet the query - start_date before and end_date after the dates supplied by the query.
In SQL this would be something like that.
SELECT R1.*, SV1.aquired, SV2.aquired
FROM ranges R1
INNER JOIN sensor_values SV1 ON SV1.sensor_id = R1.sensor_id
INNER JOIN sensor_values SV2 ON SV2.sensor_id = R1.sensor_id
WHERE SV1.aquired > R1.start_date
AND NOT EXISTS (
SELECT *
FROM sensor_values SV3
WHERE SV3.aquired > R1.start_date
AND SV3.aquired < SV1.aquired)
AND SV2.aquired > R1.end_date
AND NOT EXISTS (
SELECT *
FROM sensor_values SV4
WHERE SV4.aquired > R1.end_date
AND SV4.aquired < SV2.aquired)
AND R1.start_date < #range_start
AND R1.end_date > #range_end