How to lag over month and over year at the same time? - sql

I am attempting to find percentage change per each month per company.
Table
Year Mon company_id Revenue
2018 2018-06-01 42 2000
2018 2018-07-01 42 3000
2019 2019-06-01 42 4000
2019 2019-07-01 42 9000
I attempted this and failed.
select *, lag(Revenue) over(partition by company_id order by Year) from table
working to get the bellow result ( the table has multiple company_ids)
Year Mon company_id Revenue percentage change
2018 2018-06-01 42 2000
2018 2018-07-01 42 3000
2019 2019-06-01 42 4000 100
2019 2019-07-01 42 9000 200

Woof!
You need to tickle out the month so that you can partition by it:
WITH subq AS (
SELECT year, mon, company_id, revenue,
lag(revenue) OVER (PARTITION BY company_id, extract(month FROM mon)
ORDER BY year) AS prev_revenue
FROM "table"
)
SELECT year, mon, company_id, revenue,
(revenue - prev_revenue) * 100.0 / prev_revenue AS percent_change
FROM subq;
This assumes that the date stored in mon only serves to identify the month, and that the "year" part of it (which is different from year) is irrelevant.

You may use a window function to get exact the row with offset one year in the past.
max(revenue) over (partition by company_id order by mon range between '1 year' PRECEDING and '1 year' PRECEDING ) revenue_lag
This will work even if you have a missing data for some year (you will not match a two or more years old revenue).
Unfortunately this does not work for me in 14.6 with LAG (not sure if per design), so I'm using MAX.
Example
with tab as (
select * from (values
(2018,date'2018-06-01', 42, 2000),
(2018,date'2018-07-01', 42, 3000),
(2018,date'2018-07-15', 42, 3020),
(2019,date'2019-06-01', 42, 4000),
(2019,date'2019-07-01', 42, 9000)
) tab(year, mon, company_id, revenue)
)
select tab.*,
max(revenue) over (partition by company_id order by mon range between '1 year' PRECEDING and '1 year' PRECEDING ) revenue_lag,
lag(revenue) over (partition by company_id order by mon range between '1 year' PRECEDING and '1 year' PRECEDING ) lag
from tab
;
year|mon |company_id|revenue|revenue_lag|lag |
----+----------+----------+-------+-----------+----+
2018|2018-06-01| 42| 2000| | |
2018|2018-07-01| 42| 3000| |2000|
2018|2018-07-15| 42| 3020| |3000|
2019|2019-06-01| 42| 4000| 2000|3020|
2019|2019-07-01| 42| 9000| 3000|4000|

Related

SQL Bigquery Counting repeated customers from transaction table

I have a transaction table that looks something like this.
userid
orderDate
amount
111
2021-11-01
20
112
2021-09-07
17
111
2021-11-21
17
I want to count how many distinct customers (userid) that bought from our store this month also bought from our store in the previous month. For example, in February 2020, we had 20 customers and out of these 20 customers 7 of them also bought from our store in the previous month, January 2020. I want to do this for all the previous months so ending up with something like.
year
month
repeated customers
2020
01
11
2020
02
7
2020
03
9
I have written this but this only works for only the current month. How would I iterate or rewrite it to get the table as shown above.
WITH CURRENT_PERIOD AS (
SELECT DISTINCT userid
FROM table1
WHERE DATE(orderDate) BETWEEN DATE_TRUNC(CURRENT_DATE(),MONTH) AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
),
PREVIOUS_PERIOD AS (
SELECT DISTINCT userid
FROM table1
WHERE DATE(orderDate) BETWEEN DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH),MONTH) AND LAST_DAY(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH))
)
SELECT count(1)
FROM CURRENT_PERIOD RC
WHERE RC.userid IN (SELECT DISTINCT userid FROM PREVIOUS_PERIOD)
You can summarize to get one record per month, use lag(), and then aggregate:
select yyyymm,
countif(prev_yyyymm = date_add(yyyymm, interval -1 month)
from (select userid, date_trunc(order_date, month) as yyyymm,
lag(date_trunc(order_date, month)) over (partition by userid order by date_trunc(order_date, month)) as prev_yyyymm
from table1
group by 1, 2
) t
group by yyyymm
order by yyyymm;

Get last known record per month in BigQuery

Account balance collection, that shows the account balance of a customer at a given day:
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 1 | -300 | 2019-10-11 |
| 1 | -200 | 2019-10-10 |
| 1 | 0 | 2019-10-09 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
Notice, that customer #2 had no updates to his account balance in October.
I want to get the last account balance per customer per month. If there has been no account balance update for a customer in a given month, the last known account balance should be transferred to the current month. The result should look like that:
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 2 | 200 | 2019-10-10 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
Since the account balance of customer #2 was not updated in October but in September, we create a copy of the row from September changing the date to October. Any ideas how to achieve this in BigQuery?
Below is for BigQuery Standard SQL
#standardSQL
WITH customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
if to apply to sample data from your question - as it is in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 customer_id, -500 value, DATE '2019-10-12' timestamp UNION ALL
SELECT 1, -300, '2019-10-11' UNION ALL
SELECT 1, -200, '2019-10-10' UNION ALL
SELECT 2, 200, '2019-09-10' UNION ALL
SELECT 2, 100, '2019-08-11' UNION ALL
SELECT 2, 50, '2019-07-12' UNION ALL
SELECT 1, 600, '2019-09-02'
), customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
-- ORDER BY month DESC, customer_id
result is
Row customer_id value timestamp
1 1 -500 2019-10-12
2 2 200 2019-10-10
3 1 600 2019-09-02
4 2 200 2019-09-10
5 1 null null
6 2 100 2019-08-11
7 1 null null
8 2 50 2019-07-12
The following query should mostly answer your question by creating a 'month-end' record for each customer for every month and getting the most recent balance:
with
-- Generate a set of months
month_begins as (
select dt from unnest(generate_date_array('2019-01-01','2019-12-01', interval 1 month)) dt
),
-- Get the month ends
month_ends as (
select date_sub(date_add(dt, interval 1 month), interval 1 day) as month_end_date from month_begins
),
-- Cross Join and group so we get 1 customer record for every month to account for
-- situations where customer doesn't change balance in a month
user_month_ends as (
select
customer_id,
month_end_date
from `project.dataset.table`
cross join month_ends
group by 1,2
),
-- Fan out so for each month end, you get all balances prior to month end for each customer
values_prior_to_month_end as (
select
customer_id,
value,
timestamp,
month_end_date
from `project.dataset.table`
inner join user_month_ends using(customer_id)
where timestamp <= month_end_date
),
-- Order by most recent balance before month end, even if it was more than 1+ months ago
ordered as (
select
*,
row_number() over (partition by customer_id, month_end_date order by timestamp desc) as my_row
from values_prior_to_month_end
),
-- Finally, select only the most recent record for each customer per month
final as (
select
* except(my_row)
from ordered
where my_row = 1
)
select * from final
order by customer_id, month_end_date desc
A few caveats:
I did not order results to match your desired result set, and I also kept a month-end date to illustrate the concept. You can easily change the ordering and exclude unneeded fields.
In the month_begins CTE, I set a range of months into the future, so your result set will contain the most recent balance of 'future months'. To make this a bit prettier, consider changing '2019-12-01' to 'current_date()' and your query will always return to the end of the current month.
Your timestamp field looks to be dates, so I used date logic, but you should be able to apply the same principles to use timestamp logic if your underlying fields are actual timestamps.
In your result set, I'm not sure why your 2nd row (customer 2) would have a timestamp of '2019-10-10', that seems arbitrary as customer 2 has no 2nd balance record.
I purposefully split the logic into several CTEs so I could comment on each step easier, you could definitely perform several steps in the same code block for a more condensed query.

Max value by ID, date and last x days

Supposed I have a table :
---------------
id | date | value
------------------
1 | Jan 1 | 10
1 | Jan 2 | 12
1 | Jan 3 | 11
2 | Jan 4 | 11
I need to get the max and median value of each id, each date, each for the past 90 days. Im using query :
select id, date, value
max(value) over (partition by id, date) as max_date,
median(value) over (partition by id, date) as med_date
from table
where date > date - interval '90 days'
I tried to export the data and check manually but the result is not correct. Any thing I missed? thanks
expected output is to get maximum value of since the last 90 days. for example the date is April 5th, then it will find the maximum value from Jan 5th (the last 90 days) until April 5th. and then the date moves to April 6th, then it will do again for jan 6th until April 6h and so on for each ID
So im assuming u can get several values for same ID and Date and right ? otherwise partitioning for both id and date makes no sense
SELECT id, date, max(value), avg(value) from table where date > date - interval '90 days'
group by id, value
'group by' does the partitioning
Why are you using window functions? This seems to do what you describe:
select id,
max(value) as max_date,
percentile_disc(0.5) within group (order by value) as median_value
from table
where date > date - interval '90 days';
If you want this per date, use window functions:
select t.*
from (select t.*,
max(value) over (order by date range between '89 day' preceding and current row) as running_max_value,
percentile_disc(0.5) within group (order by value) range between '89 day' preceding and current row) as running_median_value
from t
) t
where date > date - interval '90 days';
The filter is in the outer query so the preceding period can go back further in time.

postgreSQL- Count for value between previous month start date and end date

I have a table as follows
user_id date month year visiting_id
123 11-04-2017 APRIL 2017 4500
123 12-05-2017 MAY 2017 4567
123 13-05-2017 MAY 2017 4568
123 17-05-2017 MAY 2017 4569
123 22-05-2017 MAY 2017 4570
123 11-06-2017 JUNE 2017 4571
123 12-06-2017 JUNE 2017 4572
I want to calculate the visiting count for the current month and last month at the monthly level as follows:
user_id month year visit_count_this_month visit_count_last_month
123 APRIL 2017 1 0
123 MAY 2017 4 1
123 JUNE 2017 2 4
I was able to calculate visit_count_this_month using the following query
SELECT v.user_id, v.month, v.year,
SUM(is_visit_this_month) as visit_count_this_month
FROM
(SELECT user_id, date, month, year,
CASE WHEN TO_CHAR(date, 'MM/YYYY') = TO_CHAR(date, 'MM/YYYY')
THEN 1 ELSE 0
END as is_visit_this_month
FROM visits
GROUP BY user_id, date, month, year
HAVING user_id = 123) v
GROUP BY v.user_id, v.month, v.year
However, I'm stuck with calculating visit_count_last_month. Similar to this, I also want to calculate visit_count_last_2months.
Can somebody help?
You can use a LATERAL JOIN like this:
SELECT user_id, month, year, COUNT(*) as visit_count_this_month, visit_count_last_month
FROM visits v
CROSS JOIN LATERAL (
SELECT COUNT(*) as visit_count_last_month
FROM visits
WHERE user_id = v.user_id
AND date = (CAST(v.date AS date) - interval '1 month')
) l
GROUP BY user_id, month, year, visit_count_last_month;
SQLFiddle - http://sqlfiddle.com/#!15/393c8/2
Assuming there are values for every month, you can get the counts per month first and use lag to get the previous month's values per user.
SELECT T.*
,COALESCE(LAG(visits,1) OVER(PARTITION BY USER_ID ORDER BY year,mth),0) as last_month_visits
,COALESCE(LAG(visits,2) OVER(PARTITION BY USER_ID ORDER BY year,mth),0) as last_2_month_visits
FROM (
SELECT user_id, extract(month from date) as mth, year, COUNT(*) as visits
FROM visits
GROUP BY user_id, extract(month from date), year
) T
If there can be missing months, it is best to generate all months within a specified timeframe and left join ing the table on to that. (This example shows it for all the months in 2017).
select user_id,yr,mth,visits
,coalesce(lag(visits,1) over(PARTITION BY USER_ID ORDER BY yr,mth),0) as last_month_visits
,coalesce(lag(visits,2) OVER(PARTITION BY USER_ID ORDER BY yr,mth),0) as last_2_month_visits
from (select u.user_id,extract(year from d.dt) as yr, extract(month from d.dt) as mth,count(v.visiting_id) as visits
from generate_series(date '2017-01-01', date '2017-12-31',interval '1 month') d(dt)
cross join (select distinct user_id from visits) u
left join visits v on extract(month from v.dt)=extract(month from d.dt) and extract(year from v.dt)=extract(year from d.dt) and u.user_id=v.user_id
group by u.user_id,extract(year from d.dt), extract(month from d.dt)
) t

Cumulative sum of values by month, filling in for missing months

I have this data table and I'm wondering if is possible create a query that get a cumulative sum by month considering all months until the current month.
date_added | qty
------------------------------------
2015-08-04 22:28:24.633784-03 | 1
2015-05-20 20:22:29.458541-03 | 1
2015-04-08 14:16:09.844229-03 | 1
2015-04-07 23:10:42.325081-03 | 1
2015-07-06 18:50:30.164932-03 | 1
2015-08-22 15:01:54.03697-03 | 1
2015-08-06 18:25:07.57763-03 | 1
2015-04-07 23:12:20.850783-03 | 1
2015-07-23 17:45:29.456034-03 | 1
2015-04-28 20:12:48.110922-03 | 1
2015-04-28 13:26:04.770365-03 | 1
2015-05-19 13:30:08.186289-03 | 1
2015-08-06 18:26:46.448608-03 | 1
2015-08-27 16:43:06.561005-03 | 1
2015-08-07 12:15:29.242067-03 | 1
I need a result like that:
Jan|0
Feb|0
Mar|0
Apr|5
May|7
Jun|7
Jul|9
Aug|15
This is very similar to other questions, but the best query is still tricky.
Basic query to get the running sum quickly:
SELECT to_char(date_trunc('month', date_added), 'Mon YYYY') AS mon_text
, sum(sum(qty)) OVER (ORDER BY date_trunc('month', date_added)) AS running_sum
FROM tbl
GROUP BY date_trunc('month', date_added)
ORDER BY date_trunc('month', date_added);
The tricky part is to fill in for missing months:
WITH cte AS (
SELECT date_trunc('month', date_added) AS mon, sum(qty) AS mon_sum
FROM tbl
GROUP BY 1
)
SELECT to_char(mon, 'Mon YYYY') AS mon_text
, sum(c.mon_sum) OVER (ORDER BY mon) AS running_sum
FROM (SELECT min(mon) AS min_mon FROM cte) init
, generate_series(init.min_mon, now(), interval '1 month') mon
LEFT JOIN cte c USING (mon)
ORDER BY mon;
The implicit CROSS JOIN LATERAL requires Postgres 9.3+. This starts with the first month in the table.
To start with a given month:
WITH cte AS (
SELECT date_trunc('month', date_added) AS mon, sum(qty) AS mon_sum
FROM tbl
GROUP BY 1
)
SELECT to_char(mon, 'Mon YYYY') AS mon_text
, COALESCE(sum(c.mon_sum) OVER (ORDER BY mon), 0) AS running_sum
FROM generate_series('2015-01-01'::date, now(), interval '1 month') mon
LEFT JOIN cte c USING (mon)
ORDER BY mon;
db<>fiddle here
Old sqlfiddle
Keeping months from different years apart. You did not ask for that, but you'll most likely want it.
Note that the "month" to some degree depends on the time zone setting of the current session! Details:
Ignoring time zones altogether in Rails and PostgreSQL
Related:
Calculating Cumulative Sum in PostgreSQL
PostgreSQL: running count of rows for a query 'by minute'
Postgres window function and group by exception