Conditionally grouping by date - sql

I'm having a bit trouble figure this one out.
I have two tables items and stocks
items
id | name
1 | item_1
2 | item_2
stocks
id | item_id | quantity | expired_on
1 | 1 | 5 | 2015-11-12
2 | 1 | 5 | 2015-11-13
3 | 2 | 5 | 2015-11-12
4 | 2 | 5 | 2015-11-14
I want to be able to retrieve a big table grouped by date, and for each date, group by item_id and show the sum of the quantity that's not expired.
result
date | item_id | unexpired
2015-11-11 | 1 | 10
2015-11-11 | 2 | 10
2015-11-12 | 1 | 5
2015-11-12 | 2 | 5
2015-11-13 | 1 | 0
2015-11-13 | 2 | 5
2015-11-14 | 1 | 0
2015-11-14 | 2 | 0
I'm able to retrieve the result if it's just one day
SELECT
items.id, SUM(stocks.quantity) as unexpired
FROM
items LEFT OUTER JOIN stocks
ON items.id = stocks.item_id
WHERE
stocks.expired_on > '2015-11-11'
GROUP BY
items.id, stocks.quantity
I searched around, found something called DatePart, but it doesn't seem like what I need.

Using the convenient cast from boolean to integer, which yields 0, 1 or null, to sum the unexpired only
select
to_char(d, 'YYYY-MM-DD') as date,
item_id,
sum(quantity * (expired_on > d)::int) as unexpired
from
stocks
cross join
generate_series(
'2015-11-11'::date, '2015-11-14', '1 day'
) d(d)
group by 1, 2
order by 1, 2
;
date | item_id | unexpired
------------+---------+-----------
2015-11-11 | 1 | 10
2015-11-11 | 2 | 10
2015-11-12 | 1 | 5
2015-11-12 | 2 | 5
2015-11-13 | 1 | 0
2015-11-13 | 2 | 5
2015-11-14 | 1 | 0
2015-11-14 | 2 | 0
The cross join to the generate_series supplies all dates in the given range.
The data used above:
create table stocks (
id int,
item_id int,
quantity int,
expired_on date
);
insert into stocks (id,item_id,quantity,expired_on) values
(1,1,5,'2015-11-12'),
(2,1,5,'2015-11-13'),
(3,2,5,'2015-11-12'),
(4,2,5,'2015-11-14');

You need to generate the list of dates and then use cross join to get the full combinations of dates and items. Then, a left join to the stock table gives the expired on each date. A cumulative sum -- in reverse -- calculated unexpired:
select d.dte, i.item_id,
sum(quantity) over (partition by i.item_id
order by d.dte desc
rows between unbounded preceding and 1 preceding
) as unexpired
from (select generate_series(min(expired_on) - interval '1 day', max(expired_on), interval '1 day') as dte
from stocks
) d cross join
items i left join
stocks s
on d.dte = s.expired_on and i.item_id = s.item_id;

Related

How to generate a date array and forward fill missing data using BigQuery?

I have a table with weeks of missing data (shown below):
week | customer_id | score
-----------|--------------|---------
2019-10-27 | 1 | 3
2019-11-10 | 1 | 4
2019-10-20 | 2 | 5
2019-10-27 | 2 | 8
Therefore I've used BigQuery's GENERATE_DATE_ARRAY function to fill in the missing weeks for each customer (in the range 2019-10-20 to 2019-11-10), which results in a NULL customer_id and score value for those weeks that were missing (shown below).
week | customer_id | score
-----------|--------------|---------
2019-10-20 | NULL | NULL
2019-10-27 | 1 | 3
2019-11-03 | NULL | NULL
2019-11-10 | 1 | 4
2019-10-20 | 2 | 5
2019-10-27 | 2 | 8
2019-11-03 | NULL | NULL
2019-11-10 | NULL | NULL
I want to forward fill the customer_id and score for each customer using the last non-null value so that the table looks like this:
week | customer_id | score
-----------|--------------|---------
2019-10-20 | NULL | NULL
2019-10-27 | 1 | 3
2019-11-03 | 1 | 3
2019-11-10 | 1 | 4
2019-10-20 | 2 | 5
2019-10-27 | 2 | 8
2019-11-03 | 2 | 8
2019-11-10 | 2 | 8
I wrote this query, however, since the customer_id value is NULL in some rows, I am unable to partition by this field and it is instead returning NULL values. If I filter for WHERE customer_id = 1 and remove the PARTITION BY clause, I get the desired result, but I cannot get it to work for multiple customers.
WITH weeks AS
(SELECT created_week
FROM UNNEST(GENERATE_DATE_ARRAY('2019-10-20', '2019-11-10', INTERVAL 1 WEEK)) week
),
table AS
(SELECT *, DATE_TRUNC(EXTRACT(DATE FROM created_at), WEEK) AS week,
FROM score
)
SELECT weeks.week,
COALESCE(table.customer_id, LAST_VALUE(table.customer_id IGNORE NULLS) OVER (PARTITION BY table.customer_id ORDER BY weeks.week)) AS customer_id,
COALESCE(table.score, LAST_VALUE(table.score IGNORE NULLS) OVER (PARTITION BY table.customer_id ORDER BY weeks.week)) AS score,
FROM weeks
LEFT JOIN table
ON weeks.week = table.week
I am wondering how I can generate this date array for each customer and then somehow forward fill any missing data using the last customer_id and score value for that customer. Any help would be greatly appreciated!
The most efficient way is just to generate the data as you need it:
select the_week, t.customerid, t.score
from (select DATE_TRUNC(EXTRACT(DATE FROM created_at), WEEK) AS week,
customerid, score,
lead(DATE_TRUNC(EXTRACT(DATE FROM created_at), WEEK)) over (partition by customerid order by created_at) as next_week
from t
) t cross join
unnest(generate_date_array(t.week,
date_add(t.next_week, interval -1 week),
interval 1 week
)) the_week;
By generating only the dates you need for each week, you don't need to "fill" anything in. The only downside is that you don't get data before the first week. You can fill that in if you really want, but it doesn't seem very useful.

SQL interpolating missing values for a specific date range - with some conditions

There are some similar questions on the site, but I believe mine warrants a new post because there are specific conditions that need to be incorporated.
I have a table with monthly intervals, structured like this:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 1 | 10 | 12/17/2017 | 1/17/2018 |
| 1 | 10 | 1/18/2018 | 2/18/2018 |
| 1 | 10 | 2/19/2018 | 3/19/2018 |
| 1 | 10 | 3/20/2018 | 4/20/2018 |
| 1 | 10 | 4/21/2018 | 5/21/2018 |
+----+--------+--------------+--------------+
I've found that sometimes there is a month of data missing around the end/beginning of the year where I know it should exist, like this:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 2 | 10 | 10/14/2018 | 11/14/2018 |
| 2 | 10 | 11/15/2018 | 12/15/2018 |
| 2 | 10 | 1/17/2019 | 2/17/2019 |
| 2 | 10 | 2/18/2019 | 3/18/2019 |
| 2 | 10 | 3/19/2019 | 4/19/2019 |
+----+--------+--------------+--------------+
What I need is a statement that will:
Identify where this year-end period is missing (but not find missing
months that aren't at the beginning/end of the year).
Create this interval by using the length of an existing interval for
that ID (maybe using the mean interval length for the ID to do it?). I could create the interval from the "gap" between the previous and next interval, except that won't work if I'm missing an interval at the beginning or end of the ID's record (i.e. if the record starts at say 1/16/2015, I need the amount for 12/15/2014-1/15/2015
Interpolate an 'amount' for this interval using the mean daily
'amount' from the closest existing interval.
The end result for the sample above should look like:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 2 | 10 | 10/14/2018 | 11/14/2018 |
| 2 | 10 | 11/15/2018 | 12/15/2018 |
| 2 | 10 | 12/16/2018 | 1/16/2018 |
| 2 | 10 | 1/17/2019 | 2/17/2019 |
| 2 | 10 | 2/18/2019 | 3/18/2019 |
+----+--------+--------------+--------------+
A 'nice to have' would be a flag indicating that this value is interpolated.
Is there a way to do this efficiently in SQL? I have written a solution in SAS, but have a need to move it to SQL, and my SAS solution is very inefficient (optimization isn't a goal, so any statement that does what I need is fantastic).
EDIT: I've made an SQLFiddle with my example table here:
http://sqlfiddle.com/#!18/8b16d
You can use a sequence of CTEs to build up the data for the missing periods. In this query, the first CTE (EOYS) generates all the end-of-year dates (YYYY-12-31) relevant to the table; the second (INTERVALS) the average interval length for each ID and the third (MISSING) attempts to find start (from t2) and end (from t3) dates of adjoining intervals for any missing (indicated by t1.ID IS NULL) end-of-year interval. The output of this CTE is then used in an INSERT ... SELECT query to add missing interval records to the table, generating missing dates by adding/subtracting the interval length to the end/start date of the adjacent interval as necessary.
First though we add the interp column to indicate if a row was interpolated:
ALTER TABLE Table1 ADD interp TINYINT NOT NULL DEFAULT 0;
This sets interp to 0 for all existing rows. Then we can do the INSERT, setting interp for all those rows to 1:
WITH EOYS AS (
SELECT DISTINCT DATEFROMPARTS(DATEPART(YEAR, interval_beg), 12, 31) AS eoy
FROM Table1
),
INTERVALS AS (
SELECT ID, AVG(DATEDIFF(DAY, interval_beg, interval_end)) AS interval_len
FROM Table1
GROUP BY ID
),
MISSING AS (
SELECT e.eoy,
ids.ID,
i.interval_len,
COALESCE(t2.amount, t3.amount) AS amount,
DATEADD(DAY, 1, t2.interval_end) AS interval_beg,
DATEADD(DAY, -1, t3.interval_beg) AS interval_end
FROM EOYS e
CROSS JOIN (SELECT DISTINCT ID FROM Table1) ids
JOIN INTERVALS i ON i.ID = ids.ID
LEFT JOIN Table1 t1 ON ids.ID = t1.ID
AND e.eoy BETWEEN t1.interval_beg AND t1.interval_end
LEFT JOIN Table1 t2 ON ids.ID = t2.ID
AND DATEADD(MONTH, -1, e.eoy) BETWEEN t2.interval_beg AND t2.interval_end
LEFT JOIN Table1 t3 ON ids.ID = t3.ID
AND DATEADD(MONTH, 1, e.eoy) BETWEEN t3.interval_beg AND t3.interval_end
WHERE t1.ID IS NULL
)
INSERT INTO Table1 (ID, amount, interval_beg, interval_end, interp)
SELECT ID,
amount,
COALESCE(interval_beg, DATEADD(DAY, -interval_len, interval_end)) AS interval_beg,
COALESCE(interval_end, DATEADD(DAY, interval_len, interval_beg)) AS interval_end,
1 AS interp
FROM MISSING
This adds the following rows to the table:
ID amount interval_beg interval_end interp
2 10 2017-12-05 2018-01-04 1
2 10 2018-12-16 2019-01-16 1
2 10 2019-12-28 2020-01-27 1
Demo on SQLFiddle

Showing date even zero value SQL

I have SQL Query:
SELECT Date, Hours, Counts FROM TRANSACTION_DATE
Example Output:
Date | Hours | Counts
----------------------------------
01-Feb-2018 | 20 | 5
03-Feb-2018 | 25 | 3
04-Feb-2018 | 22 | 3
05-Feb-2018 | 21 | 2
07-Feb-2018 | 28 | 1
10-Feb-2018 | 23 | 1
If you can see, there are days that missing because no data/empty, but I want the missing days to be shown and have a value of zero:
Date | Hours | Counts
----------------------------------
01-Feb-2018 | 20 | 5
02-Feb-2018 | 0 | 0
03-Feb-2018 | 25 | 3
04-Feb-2018 | 22 | 3
05-Feb-2018 | 21 | 2
06-Feb-2018 | 0 | 0
07-Feb-2018 | 28 | 1
08-Feb-2018 | 0 | 0
09-Feb-2018 | 0 | 0
10-Feb-2018 | 23 | 1
Thank you in advanced.
You need to generate a sequence of dates. If there are not too many, a recursive CTE is an easy method:
with dates as (
select min(date) as dte, max(date) as last_date
from transaction_date td
union all
select dateadd(day, 1, dte), last_date
from dates
where dte < last_date
)
select d.date, coalesce(td.hours, 0) as hours, coalesce(td.count, 0) as count
from dates d left join
transaction_date td
on d.dte = td.date;

postgresql - cumul. sum active customers by month (removing churn)

I want to create a query to get the cumulative sum by month of our active customers. The tricky thing here is that (unfortunately) some customers churn and so I need to remove them from the cumulative sum on the month they leave us.
Here is a sample of my customers table :
customer_id | begin_date | end_date
-----------------------------------------
1 | 15/09/2017 |
2 | 15/09/2017 |
3 | 19/09/2017 |
4 | 23/09/2017 |
5 | 27/09/2017 |
6 | 28/09/2017 | 15/10/2017
7 | 29/09/2017 | 16/10/2017
8 | 04/10/2017 |
9 | 04/10/2017 |
10 | 05/10/2017 |
11 | 07/10/2017 |
12 | 09/10/2017 |
13 | 11/10/2017 |
14 | 12/10/2017 |
15 | 14/10/2017 |
Here is what I am looking to achieve :
month | active customers
-----------------------------------------
2017-09 | 7
2017-10 | 6
I've managed to achieve it with the following query ... However, I'd like to know if there are a better way.
select
"begin_date" as "date",
sum((new_customers.new_customers-COALESCE(churn_customers.churn_customers,0))) OVER (ORDER BY new_customers."begin_date") as active_customers
FROM (
select
date_trunc('month',begin_date)::date as "begin_date",
count(id) as new_customers
from customers
group by 1
) as new_customers
LEFT JOIN(
select
date_trunc('month',end_date)::date as "end_date",
count(id) as churn_customers
from customers
where
end_date is not null
group by 1
) as churn_customers on new_customers."begin_date" = churn_customers."end_date"
order by 1
;
You may use a CTE to compute the total end_dates and then subtract it from the counts of start dates by using a left join
SQL Fiddle
Query 1:
WITH edt
AS (
SELECT to_char(end_date, 'yyyy-mm') AS mon
,count(*) AS ct
FROM customers
WHERE end_date IS NOT NULL
GROUP BY to_char(end_date, 'yyyy-mm')
)
SELECT to_char(c.begin_date, 'yyyy-mm') as month
,COUNT(*) - MAX(COALESCE(ct, 0)) AS active_customers
FROM customers c
LEFT JOIN edt ON to_char(c.begin_date, 'yyyy-mm') = edt.mon
GROUP BY to_char(begin_date, 'yyyy-mm')
ORDER BY month;
Results:
| month | active_customers |
|---------|------------------|
| 2017-09 | 7 |
| 2017-10 | 6 |

Get post status for each day from status change history

There is a table post_status_changes, which is history of post status changes
post_id | created_at | status
---------+---------------------+---------
3 | 2016-09-02 04:00:00 | 1
3 | 2016-09-04 19:59:21 | 2
6 | 2016-09-03 15:00:00 | 5
6 | 2016-09-03 19:52:46 | 1
6 | 2016-09-04 20:53:22 | 2
What I wanna get is a list for each day from DayA till DayB of post status for end of date.
DayA = 2016-09-01
DayB = 2016-09-05
post_id | date | status
-----------+-------------+---------
3 | 2016-09-01 | null
3 | 2016-09-02 | 1
3 | 2016-09-03 | 1
3 | 2016-09-04 | 2
3 | 2016-09-05 | 2
6 | 2016-09-01 | null
6 | 2016-09-02 | null
6 | 2016-09-03 | 1
6 | 2016-09-04 | 2
6 | 2016-09-05 | 2
Any solutions?
solution was found here: PHP: Return all dates between two dates in an array
$period = new DatePeriod(
new DateTime('2010-10-01'),
new DateInterval('P1D'),
new DateTime('2010-10-05')
);
foreach ($period as $each){
//.. QUERY here, where "CREAtED_AT" = $each
}
with a as
(select convert(varchar(10), created_at, 102) [date], [status],
post_id, rank() over (partition by convert(varchar(10), created_at),
post_id order by created_at desc) as r
from post_status_changes)
select post_id, [date], [status] from a where r =
(select top 1 r from a as a2 where a.[date] =
a2.[date] and a.[post_id] = a2.[post_id])
and #DayA <= [date] and #DayB >= [date] order by post_id, [date];
For each post_id you want as many rows as there are days between the start and end date. This can be done by cross joining the list of dates with the post_ids and then join that result back to the table to get the status for each day:
select x.post_id, t.created, p.status
from generate_series(date '2016-09-01', date '2016-09-05', interval '1' day) as t(created)
cross join (
select distinct post_id
from post_status_changes
) x
left join post_status_changes p on p.created_at::date = t.created
order by 1,2;
Running example: http://rextester.com/CSX38222