Create query returning field containing dates of last 365 days - sql

I'm using AWS QuickSight to build a dashboard with analytics and metrics related to the usage of a system. I'm trying to visualize user's registration over time. I've created a parameter and control on my dashboard that allows the dashboard user to select 'Last N days' (7, 30, 60, 90, 180, 365 days), and I have an associated line chart that will plot the related data.
However the issue is that there are some days where no user's registered, and that leaves gaps of seemingly unreported data (in the line chart). What I would like to do is JOIN my current query on day with a query that returns a single field each row containing the last 365 days.
Select count (DISTINCT id), date_trunc('day', created_at) as day
FROM users
GROUP BY day
ORDER BY day desc

To get date instead of numbers you can use below query:
Query:
with recursive date_range(day,daycount) as
(
SELECT '1 Jan 2020'::date as DAY, 1 as daycount
UNION ALL
SELECT day+1, daycount+1 from date_range WHERE daycount<365
)select day from date_range
Output:
| day |
| :--------- |
| 2020-01-01 |
| 2020-01-02 |
| 2020-01-03 |
.
.
.
.
| 2020-12-28 |
| 2020-12-29 |
| 2020-12-30 |
db<fiddle here
You can use recursive common table expression to generate that. Then you just can join that cte with your table. Please check out below code.
with recursive date_range(day) as
(
SELECT 1 as day
UNION ALL
SELECT day+1
from date_range
WHERE day < 365
)select DATE_TRUNC('day', NOW() - concat(day,' days')::interval ) as date from date_range
Output:
|date |
------------------------
|2021-06-10 00:00:00+01|
|2021-06-09 00:00:00+01|
|2021-06-08 00:00:00+01|
|2021-06-07 00:00:00+01|
|2021-06-06 00:00:00+01|
|2021-06-05 00:00:00+01|
|2021-06-04 00:00:00+01|
|2021-06-03 00:00:00+01|
|2021-06-02 00:00:00+01|
|2021-06-01 00:00:00+01|
...
db<fiddle here

Related

Querying the retention rate on multiple days with SQL

Given a simple data model that consists of a user table and a check_in table with a date field, I want to calculate the retention date of my users. So for example, for all users with one or more check ins, I want the percentage of users who did a check in on their 2nd day, on their 3rd day and so on.
My SQL skills are pretty basic as it's not a tool that I use that often in my day-to-day work, and I know that this is beyond the types of queries I am used to. I've been looking into pivot tables to achieve this but I am unsure if this is the correct path.
Edit:
The user table does not have a registration date. One can assume it only contains the ID for this example.
Here is some sample data for the check_in table:
| user_id | date |
=====================================
| 1 | 2020-09-02 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 12:00:00 |
-------------------------------------
| 1 | 2020-09-04 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 11:00:00 |
-------------------------------------
| ... |
-------------------------------------
And the expected output of the query would be something like this:
| day_0 | day_1 | day_2 | day_3 |
=================================
| 70% | 67 % | 44% | 32% |
---------------------------------
Please note that I've used random numbers for this output just to illustrate the format.
Oh, I see. Assuming you mean days between checkins for users -- and users might have none -- then just use aggregation and window functions:
select sum( (ci.date = ci.min_date)::numeric ) / u.num_users as day_0,
sum( (ci.date = ci.min_date + interval '1 day')::numeric ) / u.num_users as day_1,
sum( (ci.date = ci.min_date + interval '2 day')::numeric ) / u.num_users as day_2
from (select u.*, count(*) over () as num_users
from users u
) u left join
(select ci.user_id, ci.date::date as date,
min(min(date::date)) over (partition by user_id order by date) as min_date
from checkins ci
group by user_id, ci.date::date
) ci;
Note that this aggregates the checkins table by user id and date. This ensures that there is only one row per date.

SQL: How to construct a time series from irregular data and then subsequently calculate a rolling average over it

I am trying to calculate a rolling average of data from incident reports. The exact quantity I'm looking for is the 30-day-mean-time-to-resolution (mttr) which means the average of the time it takes to resolve incidents in the last 30 days.
My incidents table looks something like this:
| incident_id | start_datetime | end_datetime |
|-------------|-----------------------|-----------------------|
| 1 | '2020-02-01T10:13:00' | '2020-02-01T10:59:33' |
| 2 | '2020-02-01T17:55:13' | '2020-02-02T00:35:28' |
| 3 | '2020-02-03T13:33:01' | '2020-02-03T15:54:01' |
What I want is something like this (the numbers are made up so don't try to actually calculate-- just note that the datetime intervals are every hour):
| datetime | mttr_last30days_in_hours |
|-----------------------|--------------------------|
| '2020-02-01T10:00:00' | 5.7 |
| '2020-02-01T11:00:00' | 5.6 |
| '2020-02-02T12:00:00' | 5.8 |
I can calculate the mttr in the last 30 days really easily if I'm doing it just for one point in time:
SELECT avg(end_datetime - start_datetime) mttr_last30days_in_hours
FROM incidents
WHERE datetime_diff(current_datetime(), start_datetime, DAY) <= 30
The problem is this just gives me ONE number. How do I create a time series spanning the range of say, the first incident's start_datetime (min(start_datetime)) to the current time, and then get a rolling 30 day average with evenly spaced, hourly intervals (like the example table above)?
If you have an unique field in your table, you can try doing that:
WITH
t_filter AS(
SELECT
*
FROM
incidents
WHERE datetime_diff(current_datetime(), start_datetime, DAY) <= 30
),
t_dates AS (
SELECT
unique_key,
GENERATE_DATE_ARRAY(DATE(start_datetime), CURRENT_DATE(), INTERVAL 1 DAY) AS date_array
FROM
t_filter
),
t_hour AS (
SELECT *
FROM
UNNEST(["00:00:00",
"01:00:00",
"02:00:00",
"03:00:00",
"04:00:00",
"05:00:00",
"06:00:00",
"07:00:00",
"08:00:00",
"09:00:00",
"10:00:00",
"11:00:00",
"12:00:00",
"13:00:00",
"14:00:00",
"15:00:00",
"16:00:00",
"17:00:00",
"18:00:00",
"19:00:00",
"20:00:00",
"21:00:00",
"22:00:00",
"23:00:00"]) h
),
sequence AS(
SELECT
unique_key,
CONCAT(CAST(arr AS string),"T", h) date_hour
FROM
t_dates,
UNNEST(date_array) arr,
t_hour
)
SELECT
date_hour,
AVG(end_datetime - start_datetime)
FROM
sequence
LEFT JOIN
t_filter
ON
incidents.unique_key = sequence.unique_key
GROUP BY
date_hour
I hope it helps

Moving average last 30 days

I want to find the number of unique users active in the last 30 days. I want to calculate this for today, but also for days in the past. The dataset contains user ids, dates and events triggered by the user saved in BigQuery. A user is active by opening a mobile app triggering the event session_start. Example of the unnested dataset.
| resettable_device_id | date | event |
------------------------------------------------------
| xx | 2017-06-09 | session_start |
| yy | 2017-06-09 | session_start |
| xx | 2017-06-11 | session_start |
| zz | 2017-06-11 | session_start |
I found a solution which suits my problem:
BigQuery: how to group and count rows within rolling timestamp window?
My BigQuery script so far:
#standardSQL
WITH daily_aggregation AS (
SELECT
PARSE_DATE("%Y%m%d", event_dim.date) AS day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) AS unique_resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
WHERE event_dim.name = "session_start"
GROUP BY day
)
SELECT
day,
unique_resettable_device_ids,
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
FROM daily_aggregation
ORDER BY day
This script results in the following table:
| day | unique_resettable_device_ids | unique_ids_rolling_30_days |
------------------------------------------------------------------------
| 2018-06-05 | 1807 | 2614 |
| 2018-06-06 | 711 | 807 |
| 2018-06-07 | 96 | 96 |
The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids. How can I fix the rolling window function in my script?
"The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids."
Of course, as that's exactly what the code
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
is asking for.
Check out https://stackoverflow.com/a/49866033/132438 where the question asks about specifically counting uniques in a rolling window: Turns out it's a very slow operation given how much memory it requires.
The solution for this when you want a rolling count of uniques: Go for approximate results.
From the linked answer:
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
, COUNT(*) window_days
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
HAVING window_days=90
ORDER BY date_grp
Working solution for a weekly calculation of the number of active users in the last 30 days.
#standardSQL
WITH days AS (
SELECT day
FROM UNNEST(GENERATE_DATE_ARRAY('2018-01-01', CURRENT_DATE(), INTERVAL 1 WEEK)) AS day
), periods AS (
SELECT
DATE_SUB(days.day, INTERVAL 30 DAY) AS StartDate,
days.day AS EndDate FROM days
)
SELECT
periods.EndDate AS Day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) as resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
CROSS JOIN periods
WHERE
PARSE_DATE("%Y%m%d", event_dim.date) BETWEEN periods.StartDate AND periods.EndDate
AND event_dim.name = "session_start"
GROUP BY Day
ORDER BY Day DESC

In a table containing rows of date ranges, from each row, generate one row per day containing hours of utilization

Given a table with rows like:
+----+-------------------------+------------------------+
| ID | StartDate | EndDate |
+----+-------------------------+------------------------+
| 1 | 2016-02-05 20:00:00.000 | 2016-02-07 5:00:00.000 |
+----+-------------------------+------------------------+
I want to produce a table like this:
+----+------------+----------+
| ID | Date | Duration |
+----+------------+----------+
| 1 | 2016-02-05 | 4 |
| 1 | 2016-02-06 | 24 |
| 1 | 2016-02-07 | 5 |
+----+------------+----------+
This is an interview-style question. I am wondering how I can go about tackling this. Is it possible to do this with just standard SQL query syntax? Or is a procedural language like pl/pgSQL required to do a query like this?
The basic idea is this:
SELECT date_trunc('day', dayhour) as dd,count(*)
FROM (VALUES (1, '2016-02-05 20:00:00.000'::timestamp, '2016-02-07 5:00:00.000'::timestamp)
) v(ID, StartDate, EndDate), lateral
generate_series(StartDate, EndDate, interval '1 hour') g(dayhour)
GROUP BY dd
ORDER BY dd;
That adds an extra hour, so this is more accurate:
SELECT date_trunc('day', dayhour) as dd,count(*)
FROM (VALUES (1, '2016-02-05 20:00:00.000'::timestamp, '2016-02-07 5:00:00.000'::timestamp)
) v(ID, StartDate, EndDate), lateral
generate_series(StartDate, EndDate - interval '1 hour', interval '1 hour') g(dayhour)
GROUP BY dd
ORDER BY dd;
Technically, the lateral is not needed (and in that case, I would replace the comma with cross join). However, this is an example of a lateral join, so being explicit is good.
I should also note that the above is the simplest method. However, the group by does slow down the query. There are other methods that don't require generating a series for every hour.

Populating a table with all dates in a given range in Google BigQuery

Is there any convenient way to populate a table with all dates in a given range in Google BigQuery? What I need are all dates from 2015-06-01 till CURRENT_DATE(), so something like this:
+------------+
| date |
+------------+
| 2015-06-01 |
| 2015-06-02 |
| 2015-06-03 |
| ... |
| 2016-07-11 |
+------------+
Optimally, the next step would be to also get all weeks between the two dates, i.e.:
+---------+
| week |
+---------+
| 2015-23 |
| 2015-24 |
| 2015-25 |
| ... |
| 2016-28 |
+---------+
I've been fiddling around with the following answers I found, but I can't get them to work, mostly because core functions aren't supported and I can't find proper ways to replace them.
Easiest way to populate a temp table with dates between and including 2 date parameters
Generate Dates between date ranges
Your help is very much appreciated!
Best,
Max
Mikhail's answer works for BigQuery's legacy sql syntax perfectly. This solution is a slightly easier one if you're using the standard SQL syntax.
BigQuery standard SQL syntax actually has a built in function, GENERATE_DATE_ARRAY for creating an array from a date range. It takes a start date, end date and INTERVAL. For example:
SELECT day
FROM UNNEST(
GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 DAY)
) AS day
If you wanted the week and year you could use
SELECT EXTRACT(YEAR FROM day), EXTRACT(WEEK FROM day)
FROM UNNEST(
GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 WEEK)
) AS day
all dates from 2015-06-01 till CURRENT_DATE()
SELECT DATE(DATE_ADD(TIMESTAMP("2015-06-01"), pos - 1, "DAY")) AS DAY
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP(CURRENT_DATE()), TIMESTAMP("2015-06-01")), '.'),'') AS h
FROM (SELECT NULL)),h
)))
all weeks between the two dates
SELECT YEAR(DAY) AS y, WEEK(DAY) AS w
FROM (
SELECT DATE(DATE_ADD(TIMESTAMP("2015-06-01"), pos - 1, "DAY")) AS DAY
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP(CURRENT_DATE()), TIMESTAMP("2015-06-01")), '.'),'') AS h
FROM (SELECT NULL)),h
)))
)
GROUP BY y, w