SQL sum over partition for preceding period - sql

I have the following table, which represent Customers for each day:
+----------+-----------+
| Date | Customers |
+----------+-----------+
| 1/1/2014 | 4 |
| 1/2/2014 | 7 |
| 1/3/2014 | 5 |
| 1/4/2014 | 5 |
| 1/5/2014 | 10 |
| 2/1/2014 | 7 |
| 2/2/2014 | 4 |
| 2/3/2014 | 1 |
| 2/4/2014 | 5 |
+----------+-----------+
I would like to add 2 additional columns:
Summary of the customers for the current month
Summary of the customers for the preceding month
here's the desired outcome:
+----------+-----------+----------------------+------------------------+
| Date | Customers | Sum_of_Current_month | Sum_of_Preceding_month |
+----------+-----------+----------------------+------------------------+
| 1/1/2014 | 4 | 31 | 0 |
| 1/2/2014 | 7 | 31 | 0 |
| 1/3/2014 | 5 | 31 | 0 |
| 1/4/2014 | 5 | 31 | 0 |
| 1/5/2014 | 10 | 31 | 0 |
| 2/1/2014 | 7 | 17 | 31 |
| 2/2/2014 | 4 | 17 | 31 |
| 2/3/2014 | 1 | 17 | 31 |
| 2/4/2014 | 5 | 17 | 31 |
+----------+-----------+----------------------+------------------------+
I have managed to calculate the 3rd column by a simple sum over partition function:
Select
Date,
Customers,
Sum(Customers) over (Partition by (Month(Date)||year(Date) Order by 1) as Sum_of_Current_month
From table
However, I can't find a way to calculate the Sum_of_preceding_month column.
Appreciate your support.
Asaf

The previous month is a bit tricky. What's your Teradata release, TD14.10 supports LAST_VALUE:
SELECT
dt,
customers,
Sum_of_Current_month,
-- return the previous sum
COALESCE(LAST_VALUE(x ignore NULLS)
OVER (ORDER BY dt
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
,0) AS Sum_of_Preceding_month
FROM
(
SELECT
dt,
Customers,
SUM(Customers) OVER (PARTITION BY TRUNC(dt,'mon')) AS Sum_of_Current_month,
CASE -- keep the number only for the last day in month
WHEN ROW_NUMBER()
OVER (PARTITION BY TRUNC(dt,'mon')
ORDER BY dt)
= COUNT(*)
OVER (PARTITION BY TRUNC(dt,'mon'))
THEN Sum_of_Current_month
END AS x
FROM tab
) AS dt

I think this might be easier by using lag() and an aggregation sub-query. The ANSI Standard syntax is:
Select t.*, tt.sumCustomers, tt.prev_sumCustomers
From table t join
(select extract(year from date) as yyyy, extract(month from date) as mm,
sum(Customers) as sumCustomers,
lag(sum(Customers)) over (order by extract(year from date), extract(month from date)
) as prev_sumCustomers
from table t
group by extract(year from date), extract(month from date)
) tt
on extract(year from date) = tt.yyyy and extract(month from date) = t.mm;
In Teradata, this would be written as:
Select t.*, tt.sumCustomers, tt.prev_sumCustomers
From table t join
(select extract(year from date) as yyyy, extract(month from date) as mm,
sum(Customers) as sumCustomers,
min(sum(Customers)) over (order by extract(year from date), extract(month from date)
rows between 1 preceding and 1 preceding
) as prev_sumCustomers
from table t
group by extract(year from date), extract(month from date)
) tt
on extract(year from date) = tt.yyyy and extract(month from date) = t.mm;

Try this:
SELECT
[Date],
[Customers],
(SELECT SUM(customers) FROM table WHERE MONTH(dte) = MONTH(tbl.dte)),
ISNULL((SELECT SUM(customers) FROM table WHERE MONTH(dte) = MONTH(DATEADD(MONTH, -1, tbl.dte))), 0)
FROM table tbl

Related

SQL query grouping by range

Hi have a table A with the following data:
+------+-------+----+--------+
| YEAR | MONTH | PA | AMOUNT |
+------+-------+----+--------+
| 2020 | 1 | N | 100 |
+------+-------+----+--------+
| 2020 | 2 | N | 100 |
+------+-------+----+--------+
| 2020 | 3 | O | 100 |
+------+-------+----+--------+
| 2020 | 4 | N | 100 |
+------+-------+----+--------+
| 2020 | 5 | N | 100 |
+------+-------+----+--------+
| 2020 | 6 | O | 100 |
+------+-------+----+--------+
I'd like to have the following result:
+---------+---------+--------+
| FROM | TO | AMOUNT |
+---------+---------+--------+
| 2020-01 | 2020-02 | 200 |
+---------+---------+--------+
| 2020-03 | 2020-03 | 100 |
+---------+---------+--------+
| 2020-04 | 2020-05 | 200 |
+---------+---------+--------+
| 2020-06 | 2020-06 | 100 |
+---------+---------+--------+
My DB is DB2/400.
I have tried with ROW_NUMBER partitioning, subqueries but I can't figure out how to solve this.
I understand this as a gaps-and-island problem, where you want to group together adjacent rows that have the same PA.
Here is an approach using the difference between row numbers to build the groups:
select min(year_month) year_month_start, max(year_month) year_month_end, sum(amount) amount
from (
select a.*, year * 100 + month year_month
row_number() over(order by year, month) rn1,
row_number() over(partition by pa order by year, month) rn2
from a
) a
group by rn1 - rn2
order by year_month_start
You can try the below -
select min(year)||'-'||min(month) as from_date,max(year)||'-'||max(month) as to_date,sum(amount) as amount from
(
select *,row_number() over(order by month)-
row_number() over(partition by pa order by month) as grprn
from t1
)A group by grprn,pa order by grprn
This works in tsql, guess you can adaot it to db2-400?
SELECT MIN(Dte) [From]
, MAX(Dte) [To]
-- ,PA
, SUM(Amount)
FROM (
SELECT year * 100 + month Dte
, Pa
, Amount
, ROW_NUMBER() OVER (PARTITION BY pa ORDER BY year * 100 + month) +
10000- (YEar*100+Month) rn
FROM tabA a
) b
GROUP BY Pa
, rn
ORDER BY [From]
, [To]
The trick is the row number function partitioned by PA and ordered by date, This'll count one up for each month for the, when added to the descnding count of month and month, you will get the same number for consecutive months with same PA. You the group by PA and the grouping yoou made, rn, to get the froups, and then Bob's your uncle.

Group by month, day, hour + gaps and islands problem

I need to calculate (in percentage) how long was status true during day, hours or month (working_time).
I simplify my table to this one:
| date | status |
|-------------------------- |-------- |
| 2018-11-05T19:04:21.125Z | true |
| 2018-11-05T19:04:22.125Z | true |
| 2018-11-05T19:04:23.125Z | true |
| 2018-11-05T19:04:24.125Z | false |
| 2018-11-05T19:04:25.125Z | true |
....
I need to get in result (depend on parameter) this one:
for hours:
| date | working_time |
|-------------------------- |--------------|
| 2018-11-05T00:00:00.000Z | 14 |
| 2018-11-05T01:00:00.000Z | 15 |
| 2018-11-05T02:00:00.000Z | 32 |
|... | ... |
| 2018-11-05T23:00:00.000Z | 13 |
for months:
| date | working_time |
|-------------------------- |--------------|
| 2018-01-01T00:00:00.000Z | 14 |
| 2018-02-01T00:00:00.000Z | 15 |
| 2018-03-01T00:00:00.000Z | 32 |
|... | ... |
| 2018-12-01T00:00:00.000Z | 13 |
My SQL query looks like this:
SELECT date_trunc('month', date) as date,
round((EXTRACT(epoch from sum(time_diff)) / 25920) :: numeric, 2) as working_time
FROM (SELECT date,
status as current_status,
(lag(status, 1) OVER (ORDER BY date)) AS previous_status,
(date -(lag(date, 1) OVER (ORDER BY date))) AS time_diff
FROM table
) as raw_data
WHERE current_status = TRUE AND previous_status = TRUE
GROUP BY date_trunc('month', date)
ORDER BY date;
and it works ok but really slow. Any idea about optimisation? Maybse using Row_Number() function?
Try this:
SELECT t.month_reference as date,
round( sum(if(t_aux.status,1,0)) / 25920) :: numeric, 2) as working_time
#I asume you use this number because is the uptime of the system 60*18*24,
#I would use this if I wanted the total seconds in the month 60*60*24*day(Last_day(t.month_reference))
FROM (SELECT date_trunc('month', t.date) as month_reference
FROM table
) as t
left join table t_aux
on t.month_reference=date_trunc('month', t_aux.date)
so when we group by month, the sum() will only find the rows that are true and have the referenced month
and t_aux.date <
(select t1.date
from table t1
where t.month_reference=date_trunc('month', t1.date)
and t1.status=false
order by t1.date asc limit 1 )
I add this so it only selects the rows that are true until it finds a row with status false in the same month reference
GROUP BY t.month_reference
ORDER BY t.month_reference;

postgresql - cumul. sum active customers by month (removing churn)

I want to create a query to get the cumulative sum by month of our active customers. The tricky thing here is that (unfortunately) some customers churn and so I need to remove them from the cumulative sum on the month they leave us.
Here is a sample of my customers table :
customer_id | begin_date | end_date
-----------------------------------------
1 | 15/09/2017 |
2 | 15/09/2017 |
3 | 19/09/2017 |
4 | 23/09/2017 |
5 | 27/09/2017 |
6 | 28/09/2017 | 15/10/2017
7 | 29/09/2017 | 16/10/2017
8 | 04/10/2017 |
9 | 04/10/2017 |
10 | 05/10/2017 |
11 | 07/10/2017 |
12 | 09/10/2017 |
13 | 11/10/2017 |
14 | 12/10/2017 |
15 | 14/10/2017 |
Here is what I am looking to achieve :
month | active customers
-----------------------------------------
2017-09 | 7
2017-10 | 6
I've managed to achieve it with the following query ... However, I'd like to know if there are a better way.
select
"begin_date" as "date",
sum((new_customers.new_customers-COALESCE(churn_customers.churn_customers,0))) OVER (ORDER BY new_customers."begin_date") as active_customers
FROM (
select
date_trunc('month',begin_date)::date as "begin_date",
count(id) as new_customers
from customers
group by 1
) as new_customers
LEFT JOIN(
select
date_trunc('month',end_date)::date as "end_date",
count(id) as churn_customers
from customers
where
end_date is not null
group by 1
) as churn_customers on new_customers."begin_date" = churn_customers."end_date"
order by 1
;
You may use a CTE to compute the total end_dates and then subtract it from the counts of start dates by using a left join
SQL Fiddle
Query 1:
WITH edt
AS (
SELECT to_char(end_date, 'yyyy-mm') AS mon
,count(*) AS ct
FROM customers
WHERE end_date IS NOT NULL
GROUP BY to_char(end_date, 'yyyy-mm')
)
SELECT to_char(c.begin_date, 'yyyy-mm') as month
,COUNT(*) - MAX(COALESCE(ct, 0)) AS active_customers
FROM customers c
LEFT JOIN edt ON to_char(c.begin_date, 'yyyy-mm') = edt.mon
GROUP BY to_char(begin_date, 'yyyy-mm')
ORDER BY month;
Results:
| month | active_customers |
|---------|------------------|
| 2017-09 | 7 |
| 2017-10 | 6 |

Rolling 90 days active users in BigQuery, improving preformance (DAU/MAU/WAU)

I'm trying to get the number of unique events on a specific date, rolling 90/30/7 days back. I've got this working on a limited number of rows with the query bellow but for large data sets I get memory errors from the aggregated string which becomes massive.
I'm looking for a more effective way of achieving the same result.
Table looks something like this:
+---+------------+-------------+
| | date | userid |
+---+------------+-------------+
| 1 | 2013-05-14 | xxxxx |
| 2 | 2017-03-14 | xxxxx |
| 3 | 2018-01-24 | xxxxx |
| 4 | 2013-03-21 | xxxxx |
| 5 | 2014-03-19 | xxxxx |
| 6 | 2015-09-03 | xxxxx |
| 7 | 2014-02-06 | xxxxx |
| 8 | 2014-10-30 | xxxxx |
| ..| ... | ... |
+---+------------+-------------+
Format of the desired result:
+---+------------+---------------------------------------------+
| | date | active_users_7_days | active_users_90_days |
+---+------------+---------------------------------------------+
| 1 | 2013-05-14 | 1240 | 34339 |
| 2 | 2017-03-14 | 4334 | 54343 |
| 3 | 2018-01-24 | ..... | ..... |
| 4 | 2013-03-21 | ..... | ..... |
| 5 | 2014-03-19 | ..... | ..... |
| 6 | 2015-09-03 | ..... | ..... |
| 7 | 2014-02-06 | ..... | ..... |
| 8 | 2014-10-30 | ..... | ..... |
| ..| ... | ..... | ..... |
+---+------------+---------------------------------------------+
My query looks like this:
#standardSQL
WITH
T1 AS(
SELECT
date,
STRING_AGG(DISTINCT userid) AS IDs
FROM
`consumer.events`
GROUP BY
date ),
T2 AS(
SELECT
date,
STRING_AGG(IDs) OVER(ORDER BY UNIX_DATE(date) RANGE BETWEEN 90 PRECEDING
AND CURRENT ROW) AS IDs
FROM
T1 )
SELECT
date,
(
SELECT
COUNT(DISTINCT (userid))
FROM
UNNEST(SPLIT(IDs)) AS userid) AS NinetyDays
FROM
T2
Counting unique users requires a lot of resources, even more if you want results over a rolling window. For a scalable solution, look into approximate algorithms like HLL++:
https://medium.freecodecamp.org/counting-uniques-faster-in-bigquery-with-hyperloglog-5d3764493a5a
For an exact count, this would work (but gets slower as the window gets larger):
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, COUNT(DISTINCT owner_user_id) unique_90_day_users
, COUNT(DISTINCT IF(i<31,owner_user_id,null)) unique_30_day_users
, COUNT(DISTINCT IF(i<8,owner_user_id,null)) unique_7_day_users
FROM (
SELECT DATE(creation_date) date, owner_user_id
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1, 2
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp
The approximate solution produces results way faster (14s vs 366s, but then the results are approximate):
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp
Updated query that gives correct results - removing rows with less than 90 days (works when no dates are missing):
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
, COUNT(*) window_days
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
HAVING window_days=90
ORDER BY date_grp
You can aggregate the date and do the sum. What is the aggregation? Take the most recent date:
select count(*) as num_users,
sum(case when date > datediff(current_date, interval -30 day) then 1 else 0 end) as num_users_30days,
sum(case when date > datediff(current_date, interval -60 day) then 1 else 0 end) as num_users_60days,
sum(case when date > datediff(current_date, interval -90 day) then 1 else 0 end) as num_users_90days
from (select user_id, max(date) as max(date)
from `consumer.events` e
group by user_id
) e;
If the most recent date for the user is in the period, then the user should be counted.
You can get this "as-of" a particular date by using a where clause in the subquery.

Redshift count with variable

Imagine I have a table on Redshift with this similar structure. Product_Bill_ID is the Primary Key of this table.
| Store_ID | Product_Bill_ID | Payment_Date
| 1 | 1 | 01/10/2016 11:49:33
| 1 | 2 | 01/10/2016 12:38:56
| 1 | 3 | 01/10/2016 12:55:02
| 2 | 4 | 01/10/2016 16:25:05
| 2 | 5 | 02/10/2016 08:02:28
| 3 | 6 | 03/10/2016 02:32:09
If I want to query the number of Product_Bill_ID that a store sold in the first hour after it sold its first Product_Bill_ID, how could I do this?
This example should outcome
| Store_ID | First_Payment_Date | Sold_First_Hour
| 1 | 01/10/2016 11:49:33 | 2
| 2 | 01/10/2016 16:25:05 | 1
| 3 | 03/10/2016 02:32:09 | 1
You need to get the first hour. That is easy enough using window functions:
select s.*,
min(payment_date) over (partition by store_id) as first_payment_date
from sales s
Then, you need to do the date filtering and aggregation:
select store_id, count(*)
from (select s.*,
min(payment_date) over (partition by store_id) as first_payment_date
from sales s
) s
where payment_date <= first_payment_date + interval '1 hour'
group by store_id;
SELECT
store_id,
first_payment_date,
SUM(
CASE WHEN payment_date < DATEADD(hour, 1, first_payment_date) THEN 1 END
) AS sold_first_hour
FROM
(
SELECT
*,
MIN(payment_date) OVER (PARTITION BY store_id) AS first_payment_date
FROM
yourtable
)
parsed_table
GROUP BY
store_id,
first_payment_date