How to generate date series to occupy absent dates in google BiqQuery? - sql

I am trying to get daily sum of sales from a google big-query table. I used following code for that.
select Day(InvoiceDate) date, Sum(InvoiceAmount) sales from test_gmail_com.sales
where year(InvoiceDate) = Year(current_date()) and
Month(InvoiceDate) = Month(current_date())
group by date order by date
From the above query it gives only the sum of sales daily which were in the table. There is a chance that some days do not have any sales. For those kind of situations, I need to get the date and sum should be 0. As an example, in every month should 30 0r 31 rows with sum of sales. Examples show below. 4th day of the month does not have a sales. So its sum should be 0.
date | sales
-----+------
1 | 259
-----+------
2 | 359
-----+------
3 | 45
-----+------
4 | 0
-----+------
5 | 156
Is it possible to do in Big-query? Basically date column should be a series from 1 - 28/29/30 or 31st depending on the month of the year

Generting a list of dates and then joining whatever table you need on top seems the easiest. I used the generate_date_array + unnest and it looks quite clean.
To generate a list of days (one day per row):
SELECT
*
FROM
UNNEST(GENERATE_DATE_ARRAY('2018-10-01', '2020-09-30', INTERVAL 1 DAY)) AS example

You can use below to generate on fly all dates in given range (in below example it is all dates from 2015-06-01 till CURRENT_DATE() - by changing those you can control which dates range to generate)
SELECT DATE(DATE_ADD(TIMESTAMP("2015-06-01"), pos - 1, "DAY")) AS calendar_day
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP(CURRENT_DATE()), TIMESTAMP("2015-06-01")), '.'),'') AS h
FROM (SELECT NULL)),h
)))
so, now - you can use it with LEFT JOIN with your table to have all dates accounted. See potential example below
SELECT
calendar_day,
IFNULL(sales, 0) AS sales
FROM (
SELECT DATE(DATE_ADD(TIMESTAMP("2015-06-01"), pos - 1, "DAY")) AS calendar_day
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP(CURRENT_DATE()), TIMESTAMP("2015-06-01")), '.'),'') AS h
FROM (SELECT NULL)),h
)))
) AS all_dates
LEFT JOIN (
SELECT DAY(InvoiceDate) DATE, SUM(InvoiceAmount) sales
FROM test_gmail_com.sales
WHERE YEAR(InvoiceDate) = YEAR(CURRENT_DATE()) AND
MONTH(InvoiceDate) = MONTH(CURRENT_DATE())
GROUP BY DATE
)
ON DATE = calendar_day
I wanna need to get previous months sales
Below gives all days of previous month
SELECT DATE(DATE_ADD(DATE_ADD(DATE_ADD(CURRENT_DATE(), -1, "MONTH"), 1 - DAY(CURRENT_DATE()), "DAY"), pos - 1, "DAY")) AS calendar_day
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(DATE_ADD(CURRENT_DATE(), - DAY(CURRENT_DATE()), "DAY"), DATE_ADD(DATE_ADD(CURRENT_DATE(), -1, "MONTH"), 1 - DAY(CURRENT_DATE()), "DAY")), '.'),'') AS h
FROM (SELECT NULL)),h
)))

Using the Standard SQL dialect and the generate_array function to simplify the code:
WITH serialnum AS (
SELECT
sn
FROM
UNNEST(GENERATE_ARRAY(0,
DATE_DIFF(DATE_ADD(DATE_TRUNC(CURRENT_DATE()
, MONTH)
, INTERVAL 1 MONTH)
, DATE_TRUNC(CURRENT_DATE(), MONTH)
, DAY) - 1)
) AS sn
), date_seq AS (
SELECT
DATE_ADD(DATE_TRUNC(CURRENT_DATE(), MONTH),
INTERVAL(sn) DAY) AS this_day
FROM
serialnum
)
SELECT
Day(InvoiceDate) date
, Sum(IFNULL(InvoiceAmount, 0)) sales
FROM
date_seq
LEFT JOIN
test_gmail_com.sales
ON
date_seq.this_day = DAY(test_gmail_com.sales.InvoiceDate)
WHERE
year(InvoiceDate) = Year(current_date())
and
Month(InvoiceDate) = Month(current_date())
GROUP BY
date
ORDER BY
date
;
UPDATE
Or, simpler still using the generate_date_array function:
WITH date_seq AS (
SELECT
GENERATE_DATE_ARRAY(DATE_TRUNC(CURRENT_DATE(), MONTH),
DATE_ADD(DATE_ADD(DATE_TRUNC(CURRENT_DATE(), MONTH)
, INTERVAL 1 MONTH)
, INTERVAL -1 DAY)
, INTERVAL 1 DAY)
AS this_day
)
SELECT
Day(InvoiceDate) date
, Sum(IFNULL(InvoiceAmount, 0)) sales
FROM
date_seq
LEFT JOIN
test_gmail_com.sales
ON
date_seq.this_day = DAY(test_gmail_com.sales.InvoiceDate)
WHERE
year(InvoiceDate) = Year(current_date())
and
Month(InvoiceDate) = Month(current_date())
GROUP BY
date
ORDER BY
date
;

For these purposes it is practical to have a 'calendar' table, a table that just lists all the days within a certain range. For your specific question, it would suffice to have a table with the numbers 1 to 31. A quick way to get this table is to make a spreadsheet with these numbers, save it as a csv file and import this file into BigQuery as a table.
You then left outer join your result set onto this table, with ifnull(sales,0) as sales.
If you want the number of days per month (28--31) to be right, you basically have two options. Either you create a proper calendar table that covers several years and that you join on using year, month and day. Or you use the simple table with numbers 1--31 and remove numbers based on the month and the year.

For Standard SQL
WITH
splitted AS (
SELECT
*
FROM
UNNEST( SPLIT(RPAD('',
1 + DATE_DIFF(CURRENT_DATE(), DATE("2015-06-01"), DAY),
'.'),''))),
with_row_numbers AS (
SELECT
ROW_NUMBER() OVER() AS pos,
*
FROM
splitted),
calendar_day AS (
SELECT
DATE_ADD(DATE("2015-06-01"), INTERVAL (pos - 1) DAY) AS day
FROM
with_row_numbers)
SELECT
*
FROM
calendar_day
ORDER BY
day DESC

Related

Taking Count Based On Year and Month from Date Columns

I want to take count based on from and to date. using from and to date I am trying to take year and month then based on month and year taking count. can someone suggest me how can i implement this.
Database : Snowflake
You want to do more less the solution to this other question
but here let me do all the work for you:
WITH data_table(start_date, end_date) as (
SELECT * from values
('2022-01-15'::date, '2022-02-12'::date),
('2021-12-25'::date, '2022-03-18'::date),
('2022-02-25'::date, '2022-03-06'::date),
('2021-10-20'::date, '2022-01-07'::date)
), large_range as (
SELECT row_number() over (order by null)-1 as rn
FROM table(generator(ROWCOUNT => 1000))
), pre_condition as (
SELECT
date_trunc('month', start_date) as month_start
,datediff('month', month_start, date_trunc('month', end_date)) as m
FROM data_table
)
SELECT
to_char(dateadd('month', r.rn, d.month_start),'MON-YY') as month_yr
,count(*) as count
FROM pre_condition as d
JOIN large_range as r ON r.rn <= d.m
GROUP BY 1;
MONTH_YR
COUNT
Jan-22
3
Dec-21
2
Feb-22
3
Oct-21
1
Nov-21
1
Mar-22
2

How to get total amount per previous weeks

I have this table for example:
Date
amount
2021-02-16T21:06:38
10
2021-02-16T21:07:01
5
2021-02-17T01:10:12
-1
2021-02-19T12:00:00
3
2021-02-24T12:00:00
20
2021-02-25T12:00:00
-1
I want the total amount of all previous weeks, per week. So the result in this case would be:
Date
amount
2021-02-15
0
2021-02-22
17
2021-03-01
36
Note: The dates are now the start of each week (Monday).
Any help would this would be greatly appreciated.
Try This:
select week_date, sum(amount) over (order by week_date )
from (
SELECT date(date_) + cast(abs(extract(dow FROM date_) -7 ) + 1 as int) "week_date",
sum(amount) "amount"
from example group by 1) t
DEMO
Above Query will cover only the week in which transaction records are there. If you want to cover all missing week then try below query:
with cte as (
SELECT date(date_) + cast(abs(extract(dow FROM date_) -7 ) + 1 as int) "week_date",
sum(amount) "amount"
from example group by 1
)
select
t1."Date",coalesce(sum(cte.amount) over (order by t1."Date"),0)
from cte right join
(select generate_series(min(week_date)- interval '1 week', max(week_date),interval '1 week') "Date" from cte) t1 on cte.week_date=t1."Date"
DEMO
Use generate_series() to generate the dates you want. Then use left join to bring in the data and aggregate with a cumulative sum:
select gs.week,
coalesce(sum(e.amount), 0) as week_amount,
sum(coalesce(sum(e.amount), 0)) over (order by gs.week) as running_amount
from generate_series('2021-02-15'::date, '2021-03-01'::date, interval '1 week') gs(week) left join
example e
on e.date < gs.week and
e.date >= gs.week - interval '1 week'
group by gs.week
order by gs.week;
Here is a db<>fiddle.

sql user retention calculation

I have a table records like this in Athena, one user one row in a month:
month, id
2020-05 1
2020-05 2
2020-05 5
2020-06 1
2020-06 5
2020-06 6
Need to calculate the percentage=( users come both prior month and current month )/(prior month total users).
Like in the above example, users come both in May and June 1,5 , May total user 3, this should calculate a percentage of 2/3*100
with monthly_mau AS
(SELECT month as mauMonth,
date_format(date_add('month',1,cast(concat(month,'-01') AS date)), '%Y-%m') AS nextMonth,
count(distinct userid) AS monthly_mau
FROM records
GROUP BY month
ORDER BY month),
retention_mau AS
(SELECT
month,
count(distinct useridLeft) AS retention_mau
FROM (
(SELECT
userid as useridLeft,month as monthLeft,
date_format(date_add('month',1,cast(concat(month,'-01') AS date)), '%Y-%m') AS nextMonth
FROM records ) AS prior
INNER JOIN
(SELECT
month ,
userid
FROM records ) AS current
ON
prior.useridLeft = current.userid
AND prior.nextMonth = current.month )
WHERE userid is not null
GROUP BY month
ORDER BY month )
SELECT *, cast(retention_mau AS double)/cast(monthly_mau AS double)*100 AS retention_mau_percentage
FROM monthly_mau as m
INNER JOIN monthly_retention_mau AS r
ON m.nextMonth = r.month
order by r.month
This gives me percentage as 100 which is not right. Any idea?
Hmmm . . . assuming you have one row per user per month, you can use window functions and conditional aggregation:
select month, count(*) as num_users,
sum(case when prev_month = dateadd('month', -1, month) then 1 else 0 end) as both_months
from (select r.*,
cast(concat(month, '-01') AS date) as month_date,
lag(cast(concat(month, '-01') AS date)) over (partition by id order by month) as prev_month_date
from records r
) r
group by month;

Check any missing days record in Bigquery PartitionTable

I have a Bigquery table with a date partition key.
I get daily records in that table and I try to find if there's any missing day for like 3 years of historical data.
So I tried to use the following query:
SELECT KeyPartitionDate
FROM (
SELECT KeyPartitionDate, DATE(KeyPartitionDate) as day, DATE_ADD(date(KeyPartitionDate), INTERVAL 1 DAY) AS dayplusone
FROM `project.dataset.table`
)
WHERE DATE_DIFF(day, dayplusone , DAY) > 1
GROUP BY KeyPartitionDate
ORDER BY KeyPartitionDate
The query is valid but returns no results while I know there are some...
My guess is that I'm messing with the DATE_ADD function but cant tell how
Below is for BigQuery Standard SQL and just gives you the list of missing days
#standardSQL
SELECT day AS missing_days
FROM (
SELECT MIN(KeyPartitionDate) min_day, MAX(KeyPartitionDate) max_day
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day)) day
LEFT JOIN (
SELECT DISTINCT KeyPartitionDate AS day
FROM `project.dataset.table`
) t
USING(day)
WHERE t.day IS NULL
You went about this the wrong way:
day = DATE(KeyPartitionDate)
then you did
dayplusone = DATE_ADD(date(KeyPartitionDate), INTERVAL 1 DAY)
which is basically saying dayplusone = day +(1 day)
Then you do :
WHERE DATE_DIFF(day, dayplusone , DAY) > 1
which is like saying : dayplusone - day > (1 day) which would mean
day + (1 day) - day > (1 day)
You can clearly see why that is wrong.
What you needed to do instead is compare the current row date with the preivous row date. That is achieved using window functions:
SELECT KeyPartitionDate FROM (
SELECT DISTINCT KeyPartitionDate,
LAG(KeyPartitionDate)
OVER (ORDER BY KeyPartitionDate ASC) AS PreviousKeyPartitionDate
FROM `project.dataset.table`)
WHERE DATE_DIFF(DATE(PreviousKeyPartitionDate),DATE(KeyPartitionDate), DAY ) > 1
ORDER BY KeyPartitionDate

Google BigQuery: Rolling Count Distinct

I have a table with is simply a list of dates and user IDs (not aggregated).
We define a metric called active users for a given date by counting the distinct number of IDs that appear in the previous 45 days.
I am trying to run a query in BigQuery that, for each day, returns the day plus the number of active users for that day (count distinct user from 45 days ago until today).
I have experimented with window functions, but can't figure out how to define a range based on the date values in a column. Instead, I believe the following query would work in a database like MySQL, but does not in BigQuery.
SELECT
day,
(SELECT
COUNT(DISTINCT visid)
FROM daily_users
WHERE day BETWEEN DATE_ADD(t.day, -45, "DAY") AND t.day
) AS active_users
FROM daily_users AS t
GROUP BY 1
This doesn't work in BigQuery: "Subselect not allowed in SELECT clause."
How to do this in BigQuery?
BigQuery documentation claims that count(distinct) works as a window function. However, that doesn't help you, because you are not looking for a traditional window frame.
One method would adds a record for each date after a visit:
select theday, count(distinct visid)
from (select date_add(u.day, n.n, "day") as theday, u.visid
from daily_users u cross join
(select 1 as n union all select 2 union all . . .
select 45
) n
) u
group by theday;
Note: there may be simpler ways to generate a series of 45 integers in BigQuery.
Below should work with BigQuery
#legacySQL
SELECT day, active_users FROM (
SELECT
day,
COUNT(DISTINCT id)
OVER (ORDER BY ts RANGE BETWEEN 45*24*3600 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, TIMESTAMP_TO_SEC(TIMESTAMP(day)) AS ts
FROM daily_users
)
) GROUP BY 1, 2 ORDER BY 1
Above assumes that day field is represented as '2016-01-10' format.
If it is not a case , you should adjust TIMESTAMP_TO_SEC(TIMESTAMP(day)) in most inner select
Also please take a look at COUNT(DISTINC) specifics in BigQuery
Update for BigQuery Standard SQL
#standardSQL
SELECT
day,
(SELECT COUNT(DISTINCT id) FROM UNNEST(active_users) id) AS active_users
FROM (
SELECT
day,
ARRAY_AGG(id)
OVER (ORDER BY ts RANGE BETWEEN 3888000 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, UNIX_DATE(PARSE_DATE('%Y-%m-%d', day)) * 24 * 3600 AS ts
FROM daily_users
)
)
GROUP BY 1, 2
ORDER BY 1
You can test / play with it using below dummy sample
#standardSQL
WITH daily_users AS (
SELECT 1 AS id, '2016-01-10' AS day UNION ALL
SELECT 2 AS id, '2016-01-10' AS day UNION ALL
SELECT 1 AS id, '2016-01-11' AS day UNION ALL
SELECT 3 AS id, '2016-01-11' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-13' AS day
)
SELECT
day,
(SELECT COUNT(DISTINCT id) FROM UNNEST(active_users) id) AS active_users
FROM (
SELECT
day,
ARRAY_AGG(id)
OVER (ORDER BY ts RANGE BETWEEN 86400 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, UNIX_DATE(PARSE_DATE('%Y-%m-%d', day)) * 24 * 3600 AS ts
FROM daily_users
)
)
GROUP BY 1, 2
ORDER BY 1