Calculate standdard deviation over time - sql

I have information about sales per day. For example:
Date - Product - Amount
01-07-2020 - A - 10
01-03-2020 - A - 20
01-02-2020 - B - 10
Now I would like to know the average sales per day and the standard deviation for the last year. For average I can just count the number of entries per item, and then count 365-amount of entries and take that many 0's, but I wonder what the best way is to calculate the standard deviation while incorporating the 0's for the days there are not sales.

Use a hierarchical (or recursive) query to generate daily dates for the year and then use a PARTITION OUTER JOIN to join it to your product data then you can find the average and standard deviation with the AVG and STDDEV aggregation functions and use COALESCE to fill in NULL values with zeroes:
WITH start_date ( dt ) AS (
SELECT DATE '2020-01-01' FROM DUAL
),
calendar ( dt ) AS (
SELECT dt + LEVEL - 1
FROM start_date
CONNECT BY dt + LEVEL - 1 < ADD_MONTHS( dt, 12 )
)
SELECT product,
AVG( COALESCE( amount, 0 ) ) AS average_sales_per_day,
STDDEV( COALESCE( amount, 0 ) ) AS stddev_sales_per_day
FROM calendar c
LEFT OUTER JOIN (
SELECT t.*
FROM test_data t
INNER JOIN start_date s
ON (
s.dt <= t."DATE"
AND t."DATE" < ADD_MONTHS( s.dt, 12 )
)
) t
PARTITION BY ( t.product )
ON ( c.dt = t."DATE" )
GROUP BY product
So, for your sample data:
CREATE TABLE test_data ( "DATE", Product, Amount ) AS
SELECT DATE '2020-07-01', 'A', 10 FROM DUAL UNION ALL
SELECT DATE '2020-03-01', 'A', 20 FROM DUAL UNION ALL
SELECT DATE '2020-02-01', 'B', 10 FROM DUAL;
This outputs:
PRODUCT | AVERAGE_SALES_PER_DAY | STDDEV_SALES_PER_DAY
:------ | ----------------------------------------: | ----------------------------------------:
A | .0819672131147540983606557377049180327869 | 1.16752986363678031669548047505759328696
B | .027322404371584699453551912568306010929 | .5227083734893166933219264686616717636897
db<>fiddle here

Related

Use SQL to get monthly churn count and churn rate

Currently using Postgres 9.5
I want to calculate monthly churn_count and churn_rate of the search function.
churn_count: number of users who used the search function last month but not this month
churn_rate: churn_count/total_users_last_month
My dummy data is:
CREATE TABLE yammer_events (
occurred_at TIMESTAMP,
user_id INT,
event_name VARCHAR(50)
);
INSERT INTO yammer_events (occurred_at, user_id, event_name) VALUES
('2014-06-01 00:00:01', 1, 'search_autocomplete'),
('2014-06-01 00:00:01', 2, 'search_autocomplete'),
('2014-07-01 00:00:01', 1, 'search_run'),
('2014-07-01 00:00:02', 1, 'search_run'),
('2014-07-01 00:00:01', 2, 'search_run'),
('2014-07-01 00:00:01', 3, 'search_run'),
('2014-08-01 00:00:01', 1, 'search_run'),
('2014-08-01 00:00:01', 4, 'search_run');
Ideal output should be:
|month |churn_count|churn_rate_percentage|
|--- |--- |--- |
|2014-07-01|0 |0
|2014-08-01|2 |66.6 |
In June: user 1, 2 (2 users)
In July: user 1, 2, 3 (3 users)
In August: user 1, 4 (2 users)
In July, we didn't lose any customer. In August, we lost customer 2 and 3, so the churn_count is 2, and the rate is 2/3*100 = 66.6
I tried the following query to calculate churn_count, but the result is really weird.
WITH monthly_activity AS (
SELECT distinct DATE_TRUNC('month', occurred_at) AS month,
user_id
FROM yammer_events
WHERE event_name LIKE 'search%'
)
SELECT last_month.month+INTERVAL '1 month', COUNT(DISTINCT last_month.user_id)
FROM monthly_activity last_month
LEFT JOIN monthly_activity this_month
ON last_month.user_id = this_month.user_id
AND this_month.month = last_month.month + INTERVAL '1 month'
AND this_month.user_id IS NULL
GROUP BY 1
db<>fiddle
Thank you in advance!
An easy way to do it would be to aggregate the users in an array, and from there extract and count the intersection between the current month and the previous one using the window function LAG(), e.g.
WITH j AS (
SELECT date_trunc('month',occurred_at::date) AS month,
array_agg(distinct user_id) AS users,
count(distinct user_id) AS total_users
FROM yammer_events
GROUP BY 1
ORDER BY 1
)
SELECT month::date,
cardinality(LAG(users) OVER w - users) AS churn_count,
(cardinality(LAG(users) OVER w - users)::numeric /
LAG(total_users) OVER w::numeric) * 100 AS churn_rate_percentage
FROM j
WINDOW w AS (ORDER BY month
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW);
month | churn_count | churn_rate_percentage
------------+-------------+-------------------------
2014-06-01 | |
2014-07-01 | 0 | 0.00000000000000000000
2014-08-01 | 2 | 66.66666666666666666700
(3 rows)
Note: this query relies on the extension intarray. In case you don't have it in your system, just hit:
CREATE EXTENSION intarray;
Demo: db<>fiddle
WITH monthly_activity AS (
SELECT distinct DATE_TRUNC('month', occurred_at) AS month,
user_id
FROM yammer_events
WHERE event_name LIKE 'search%'
)
SELECT
last_month.month+INTERVAL '1 month',
SUM(CASE WHEN this_month.month IS NULL THEN 1 ELSE 0 END) AS churn_count,
SUM(CASE WHEN this_month.month IS NULL THEN 1 ELSE 0 END)*1.00/COUNT(DISTINCT last_month.user_id)*100 AS churn_rate_percentage
FROM monthly_activity last_month
LEFT JOIN monthly_activity this_month
ON last_month.month + INTERVAL '1 month' = this_month.month
AND last_month.user_id = this_month.user_id
GROUP BY 1
ORDER BY 1
LIMIT 2
I think my way is more circuitous but easier for beginners to understand. Just for your reference.

Oracle SQL calculate average/opening/closing balances from discrete data

I have account balances like this
acc_no balance balance_date
account1 5000 2020-01-01
account1 6000 2020-01-05
account2 3000 2020-01-01
account1 3500 2020-01-08
account2 7500 2020-01-15
the effective balance for any day without a balance entry is equal to to the last balance. eg account1 balance on 2,3,4 Jan is 5000 etc.
I would like to produce the daily average, opening and closing balance from this data for any period. I came up with the following query and it works but it takes half an hour when I run it against the full data set. Is my approach correct or there is a more efficient method?
WITH cte_period
AS (
SELECT '2020-01-01' date_from
,'2020-01-31' date_to
FROM dual
)
,cte_calendar
AS (
SELECT rownum
,(
SELECT to_date(date_from, 'YYYY-MM-DD')
FROM cte_period
) + rownum - 1 AS balance_day
FROM dual connect BY rownum <= (
SELECT to_date(date_to, 'YYYY-MM-DD')
FROM cte_period
) - (
SELECT to_date(date_from, 'YYYY-MM-DD')
FROM cte_period
) + 1
)
,cte_balances
AS (
SELECT 'account1' acc_no
,5000 balance
,to_date('2020-01-01', 'YYYY-MM-DD') sys_date
FROM dual
UNION ALL
SELECT 'account1'
,6000
,to_date('2020-01-05', 'YYYY-MM-DD')
FROM dual
UNION ALL
SELECT 'account2'
,3000
,to_date('2020-01-01', 'YYYY-MM-DD')
FROM dual
UNION ALL
SELECT 'account1'
,3500
,to_date('2020-01-08', 'YYYY-MM-DD')
FROM dual
UNION ALL
SELECT 'account2'
,7500
,to_date('2020-01-15', 'YYYY-MM-DD')
FROM dual
)
,cte_accounts
AS (
SELECT DISTINCT acc_no
FROM cte_balances
)
SELECT t.acc_no
,(
SELECT eff_bal
FROM (
SELECT cal.balance_day
,acc_nos.acc_no
,(
SELECT balance
FROM cte_balances bal
WHERE bal.sys_date <= cal.balance_day
AND acc_nos.acc_no = bal.acc_no
ORDER BY bal.sys_date DESC FETCH first 1 row ONLY
) eff_bal
FROM cte_calendar cal
CROSS JOIN cte_accounts acc_nos
) t1
WHERE balance_day = (
SELECT to_date(date_from, 'YYYY-MM-DD')
FROM cte_period
)
AND t.acc_no = t1.acc_no
) opening_bal
,(
SELECT eff_bal
FROM (
SELECT cal.balance_day
,acc_nos.acc_no
,(
SELECT balance
FROM cte_balances bal
WHERE bal.sys_date <= cal.balance_day
AND acc_nos.acc_no = bal.acc_no
ORDER BY bal.sys_date DESC FETCH first 1 row ONLY
) eff_bal
FROM cte_calendar cal
CROSS JOIN cte_accounts acc_nos
) t1
WHERE balance_day = (
SELECT to_date(date_to, 'YYYY-MM-DD')
FROM cte_period
)
AND t.acc_no = t1.acc_no
) closing_bal
,round(avg(eff_bal), 2) avg_bal
FROM (
SELECT cal.balance_day
,acc_nos.acc_no
,(
SELECT balance
FROM cte_balances bal
WHERE bal.sys_date <= cal.balance_day
AND acc_nos.acc_no = bal.acc_no
ORDER BY bal.sys_date DESC FETCH first 1 row ONLY
) eff_bal
FROM cte_calendar cal
CROSS JOIN cte_accounts acc_nos
) t
GROUP BY acc_no
order by acc_no
The expected result
ACC_NO OPENING_BAL CLOSING_BAL AVG_BAL
account1 5000 3500 3935.48
account2 3000 7500 5467.74
Yes. You are unnecesary selecting from same table many times. Produce calendar as you did, join with your data partitioned by account and use analytic functions for computations:
select acc_no, round(avg(bal), 2) av_bal,
max(bal) keep (dense_rank first order by day) op_bal,
max(bal) keep (dense_rank last order by day) cl_bal
from (
select acc_no, day,
nvl(balance, lag(balance) ignore nulls over (partition by acc_no order by day)) bal
from (
select date_from + level - 1 as day
from (select date '2020-01-01' date_from, date '2020-01-31' date_to from dual)
connect by date_from + level - 1 <= date_to)
left join cte_balances partition by (acc_no) on day = sys_date)
group by acc_no
dbfiddle
Edit:
sometimes the first day of the month has no balance entry, it should
take form the last available
We have to treat first row in special way. It's done in subquery data, where in case of first row and null balance I run correlated subquery which looks for balance from max previous date.
with
cte_calendar as (
select level lvl, date_from + level - 1 as day
from (select date '2020-01-01' date_from, date '2020-01-31' date_to from dual)
connect by date_from + level - 1 <= date_to),
data as (
select lvl, day, acc_no,
case when balance is null and lvl = 1
then (select max(balance) keep (dense_rank last order by sys_date)
from cte_balances a
where a.acc_no = b.acc_no and a.sys_date <= day)
else balance
end bal
from cte_calendar
left join cte_balances b partition by (acc_no) on day = sys_date)
select acc_no,
max(bal) keep (dense_rank first order by day) op_bal,
max(bal) keep (dense_rank last order by day) cl_bal,
round(avg(bal), 2)
from (
select acc_no, day,
nvl(bal, lag(bal) ignore nulls over (partition by acc_no order by day)) bal
from data)
group by acc_no
dbfiddle
although I don't understand it yet
There are thre things, which are not obvoius here and you should know to understand query:
partitioned outer join. It's main part of the solution which produces whole period for each account. You can read about them here for instance,
lag() ignore nulls - fills null balance values, take them from previous not null,
max(bal) keep (dense_rank first order by day) takes balance value from first date for opening balance. last - from last row for closing balance.
If you can afford using first_value, last_value analytic functions, then this, based on my understanding of your description, may help:
with data as (
select 'account1' as acc, 5000 as balance, to_date('2020-01-01', 'YYYY-MM-DD') as d from dual
union all select 'account1' as acc, 6000 as balance, to_date('2020-01-05', 'YYYY-MM-DD') as d from dual
union all select 'account2' as acc, 3000 as balance, to_date('2020-01-01', 'YYYY-MM-DD') as d from dual
union all select 'account1' as acc, 3500 as balance, to_date('2020-01-08', 'YYYY-MM-DD') as d from dual
union all select 'account1' as acc, 7500 as balance, to_date('2020-01-15', 'YYYY-MM-DD') as d from dual
)
select acc, avg(balance) over (partition by acc order by balance) as average,
first_value(balance) over(partition by acc order by balance asc rows unbounded preceding) as first,
last_value(balance) over(partition by acc order by balance asc rows unbounded preceding) as last
from data
where d between to_date('2020-01-01', 'YYYY-MM-DD') and to_date('2020-01-06', 'YYYY-MM-DD')
order by acc
ACC | AVERAGE | FIRST | LAST
:------- | ------: | ----: | ---:
account1 | 5000 | 5000 | 5000
account1 | 5500 | 5000 | 6000
account2 | 3000 | 3000 | 3000
db<>fiddle here

Get last known record per month in BigQuery

Account balance collection, that shows the account balance of a customer at a given day:
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 1 | -300 | 2019-10-11 |
| 1 | -200 | 2019-10-10 |
| 1 | 0 | 2019-10-09 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
Notice, that customer #2 had no updates to his account balance in October.
I want to get the last account balance per customer per month. If there has been no account balance update for a customer in a given month, the last known account balance should be transferred to the current month. The result should look like that:
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 2 | 200 | 2019-10-10 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
Since the account balance of customer #2 was not updated in October but in September, we create a copy of the row from September changing the date to October. Any ideas how to achieve this in BigQuery?
Below is for BigQuery Standard SQL
#standardSQL
WITH customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
if to apply to sample data from your question - as it is in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 customer_id, -500 value, DATE '2019-10-12' timestamp UNION ALL
SELECT 1, -300, '2019-10-11' UNION ALL
SELECT 1, -200, '2019-10-10' UNION ALL
SELECT 2, 200, '2019-09-10' UNION ALL
SELECT 2, 100, '2019-08-11' UNION ALL
SELECT 2, 50, '2019-07-12' UNION ALL
SELECT 1, 600, '2019-09-02'
), customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
-- ORDER BY month DESC, customer_id
result is
Row customer_id value timestamp
1 1 -500 2019-10-12
2 2 200 2019-10-10
3 1 600 2019-09-02
4 2 200 2019-09-10
5 1 null null
6 2 100 2019-08-11
7 1 null null
8 2 50 2019-07-12
The following query should mostly answer your question by creating a 'month-end' record for each customer for every month and getting the most recent balance:
with
-- Generate a set of months
month_begins as (
select dt from unnest(generate_date_array('2019-01-01','2019-12-01', interval 1 month)) dt
),
-- Get the month ends
month_ends as (
select date_sub(date_add(dt, interval 1 month), interval 1 day) as month_end_date from month_begins
),
-- Cross Join and group so we get 1 customer record for every month to account for
-- situations where customer doesn't change balance in a month
user_month_ends as (
select
customer_id,
month_end_date
from `project.dataset.table`
cross join month_ends
group by 1,2
),
-- Fan out so for each month end, you get all balances prior to month end for each customer
values_prior_to_month_end as (
select
customer_id,
value,
timestamp,
month_end_date
from `project.dataset.table`
inner join user_month_ends using(customer_id)
where timestamp <= month_end_date
),
-- Order by most recent balance before month end, even if it was more than 1+ months ago
ordered as (
select
*,
row_number() over (partition by customer_id, month_end_date order by timestamp desc) as my_row
from values_prior_to_month_end
),
-- Finally, select only the most recent record for each customer per month
final as (
select
* except(my_row)
from ordered
where my_row = 1
)
select * from final
order by customer_id, month_end_date desc
A few caveats:
I did not order results to match your desired result set, and I also kept a month-end date to illustrate the concept. You can easily change the ordering and exclude unneeded fields.
In the month_begins CTE, I set a range of months into the future, so your result set will contain the most recent balance of 'future months'. To make this a bit prettier, consider changing '2019-12-01' to 'current_date()' and your query will always return to the end of the current month.
Your timestamp field looks to be dates, so I used date logic, but you should be able to apply the same principles to use timestamp logic if your underlying fields are actual timestamps.
In your result set, I'm not sure why your 2nd row (customer 2) would have a timestamp of '2019-10-10', that seems arbitrary as customer 2 has no 2nd balance record.
I purposefully split the logic into several CTEs so I could comment on each step easier, you could definitely perform several steps in the same code block for a more condensed query.

SQL query needed - Counting 365 days backwards

I have searched the forum many times but couldn't find a solution for my situation. I am working with an Oracle database.
I have a table with all Order Numbers and Customer Numbers by Day. It looks like this:
Day | Customer Nbr | Order Nbr
2018-01-05 | 25687459 | 256
2018-01-09 | 36478592 | 398
2018-03-07 | 25687459 | 1547
and so on....
Now I need a SQL Query which gives me a table by day and Customer Nbr and counts the number of unique Order Numbers within the last 365 days starting from column 1.
For the example above the resulting table should look like:
Day | Customer Nbr | Order Cnt
2019-01-01 | 25687459 | 2
2019-01-02 | 25687459 | 2
...
2019-03-01 | 25687459 | 1
One method is to generate values for all days of interest for each customer and then use a correlated subquery:
with dates as (
select date '2019-01-01' + rownum as dte from dual
connect by date '2019-01-01' + rownum < sysdate
)
select d.dte, t.customer_nbr,
(select count(*)
from t t2
where t2.customer_nbr = t.customer_nbr and
t2.day <= t.dte and
t2.date > t.dte - 365
) as order_cnt
from dates d cross join
(select distinct customer_nbr from t) ;
Edit:
I've just seen you clarify the question, which I've interpreted to mean:
For every day in the last year, show how many orders there were for each customer between that date, and 1 year previously. Working on an answer now...
Updated Answer:
For each customer, we count the number of records between the order day, and 365 days before it...
WITH yourTable AS
(
SELECT SYSDATE - 1 Day, 'Alex' CustomerNbr FROM DUAL
UNION ALL
SELECT SYSDATE - 2, 'Alex' FROM DUAL
UNION ALL
SELECT SYSDATE - 366, 'Alex'FROM DUAL
UNION ALL
SELECT SYSDATE - 400, 'Alex'FROM DUAL
UNION ALL
SELECT SYSDATE - 500, 'Alex'FROM DUAL
UNION ALL
SELECT SYSDATE - 1, 'Joe'FROM DUAL
UNION ALL
SELECT SYSDATE - 300, 'Chris'FROM DUAL
UNION ALL
SELECT SYSDATE - 1, 'Chris'FROM DUAL
)
SELECT Day, CustomerNbr, OrdersLast365Days
FROM yourTable t
OUTER APPLY
(
SELECT COUNT(1) OrdersLast365Days
FROM yourTable t2
WHERE t.CustomerNbr = t2.CustomerNbr
AND TRUNC(t2.Day) >= TRUNC(t.Day) - 364
AND TRUNC(t2.Day) <= TRUNC(t.Day)
)
ORDER BY t.Day DESC, t.CustomerNbr;
If you want to report on just the days you have orders for, then a simple WHERE clause should be enough:
SELECT Day, CustomerNbr, COUNT(1) OrderCount
FROM <yourTable>
WHERE TRUNC(DAY) >= TRUNC(SYSDATE -364)
GROUP BY Day, CustomerNbr
ORDER BY Day Desc;
If you want to report on every day, you'll need to generate them first. This can be done by a recursive CTE, which you then join to your table:
WITH last365Days AS
(
SELECT TRUNC (SYSDATE - ROWNUM + 1) dt
FROM DUAL CONNECT BY ROWNUM < 365
)
SELECT d.Day, COALESCE(t.CustomerNbr, 'None') CustomerNbr, SUM(CASE WHEN t.CustomerNbr IS NULL THEN 0 ELSE 1 END) OrderCount
FROM last365Days d
LEFT OUTER JOIN <yourTable> t
ON d.Day = TRUNC(t.Day)
GROUP BY d.Day, t.CustomerNbr
ORDER BY d.Day Desc;
I would probably have done it with and analytic function. In your windowing clause, you can specify a number of rows before, or a range. In this case I will use a range.
This will give you, For Each customer for each day the number of orders during one rolling year before the date displayed
WITH DATES AS (
SELECT * FROM
(SELECT TRUNC(SYSDATE)-(LEVEL-1) AS DAY FROM DUAL CONNECT BY TRUNC(SYSDATE)-(LEVEL-1) >= ( SELECT MIN(TRUNC(DAY)) FROM MY_TABLE ))
CROSS JOIN
(SELECT DISTINCT CUST_ID FROM MY_TABLE))
SELECT DISTINCT
DATES.DAY,
DATES.CUST_ID,
COUNT(ORDER_ID) OVER (PARTITION BY DATES.CUST_ID ORDER BY DATES.DAY RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND INTERVAL '1' SECOND PRECEDING)
FROM
DATES
LEFT JOIN
MY_TABLE
ON DATES.DAY=TRUNC(MY_TABLE.DAY) AND DATES.CUST_ID=MY_TABLE.CUST_ID
ORDER BY DATES.CUST_ID,DATES.DAY;

Add Missing monthly dates in a timeseries data in Postgresql

I have monthly time series data in table where dates are as a last day of month. Some of the dates are missing in the data. I want to insert those dates and put zero value for other attributes.
Table is as follows:
id report_date price
1 2015-01-31 40
1 2015-02-28 56
1 2015-04-30 34
2 2014-05-31 45
2 2014-08-31 47
I want to convert this table to
id report_date price
1 2015-01-31 40
1 2015-02-28 56
1 2015-03-31 0
1 2015-04-30 34
2 2014-05-31 45
2 2014-06-30 0
2 2014-07-31 0
2 2014-08-31 47
Is there any way we can do this in Postgresql?
Currently we are doing this in Python. As our data is growing day by day and its not efficient to handle I/O just for one task.
Thank you
You can do this using generate_series() to generate the dates and then left join to bring in the values:
with m as (
select id, min(report_date) as minrd, max(report_date) as maxrd
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select m.*, generate_series(minrd, maxrd, interval '1' month) as report_date
from m
) m left join
t
on m.report_date = t.report_date;
EDIT:
Turns out that the above doesn't quite work, because adding months to the end of month doesn't keep the last day of the month.
This is easily fixed:
with t as (
select 1 as id, date '2012-01-31' as report_date, 10 as price union all
select 1 as id, date '2012-04-30', 20
), m as (
select id, min(report_date) - interval '1 day' as minrd, max(report_date) - interval '1 day' as maxrd
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select m.*, generate_series(minrd, maxrd, interval '1' month) + interval '1 day' as report_date
from m
) m left join
t
on m.report_date = t.report_date;
The first CTE is just to generate sample data.
This is a slight improvement over Gordon's query which fails to get the last date of a month in some cases.
Essentially you generate all the month end dates between the min and max date for each id (using generate_series) and left join on this generated table to show the missing dates with 0 price.
with minmax as (
select id, min(report_date) as mindt, max(report_date) as maxdt
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select *,
generate_series(date_trunc('MONTH',mindt+interval '1' day),
date_trunc('MONTH',maxdt+interval '1' day),
interval '1' month) - interval '1 day' as report_date
from minmax
) m
left join t on m.report_date = t.report_date
Sample Demo