How to identify and aggregate sequence from start and end dates - sql

I'm trying to identify a consecutive sequence in dates, per person, as well as sum amount for that sequence. My records table looks like this:
person start_date end_date amount
1 2015-09-10 2015-09-11 500
1 2015-09-11 2015-09-12 100
1 2015-09-13 2015-09-14 200
1 2015-10-05 2015-10-07 2000
2 2015-10-05 2015-10-05 300
2 2015-10-06 2015-10-06 1000
3 2015-04-23 2015-04-23 900
The resulting query should be this:
person sequence_start_date sequence_end_date amount
1 2015-09-10 2015-09-14 800
1 2015-10-05 2015-10-07 2000
2 2015-10-05 2015-10-06 1400
3 2015-04-23 2015-04-23 900
Below, I can use LAG and LEAD to identify the sequence start_date and end_date, but I don't have a way to aggregate the amount. I'm assuming the answer will involve some sort of ROW_NUMBER() window function that will partition by sequence, I just can't figure out how to make the sequence identifiable to the function.
SELECT
person
,COALESCE(sequence_start_date, LAG(sequence_start_date, 1) OVER (ORDER BY person, start_date)) AS "sequence_start_date"
,COALESCE(sequence_end_date, LEAD(sequence_end_date, 1) OVER (ORDER BY person, start_date)) AS "sequence_end_date"
FROM
(
SELECT
person
,start_date
,end_date
,CASE WHEN LAG(end_date, 1) OVER (PARTITION BY person ORDER BY start_date) + interval '1 day' = start_date
THEN NULL
ELSE start_date
END AS "sequence_start_date"
,CASE WHEN LEAD(start_date, 1) OVER (PARTITION BY person ORDER BY start_date) - interval '1 day' = end_date
THEN NULL
ELSE end_date
END AS "sequence_end_date"
,amount
FROM records
) sq

Even your updated (sub)query still isn't quite right for the data you've presented, which is inconsistent about whether the start date of the second and subsequent rows in a sequence should be equal to their previous rows' end date or one day later. The query can be updated pretty easily to accommodate both, if that's needed.
In any case, you cannot use COALESCE as a window function. Aggregate functions may be used as window functions by providing an OVER clause, but not ordinary functions. There are nevertheless ways to apply window function to this task. Here's a way to identify the sequences in your data (as presented):
SELECT
person
,MAX(sequence_start_date)
OVER (
PARTITION BY person
ORDER BY start_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "sequence_start_date"
,MIN(sequence_end_date)
OVER (
PARTITION BY person
ORDER BY start_date
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
AS "sequence_end_date"
,amount
FROM
(
SELECT
person
,start_date
,end_date
,CASE WHEN LAG(end_date, 1) OVER (PARTITION BY person ORDER BY start_date) + interval '1 day' >= start_date
THEN date '0001-01-01'
ELSE start_date
END AS "sequence_start_date"
,CASE WHEN LEAD(start_date, 1) OVER (PARTITION BY person ORDER BY start_date) - interval '1 day' <= end_date
THEN NULL
ELSE end_date
END AS "sequence_end_date"
,amount
FROM records
order by person, start_date
) sq_part
ORDER BY person, sequence_start_date
That relies on MAX() and MIN() instead of COALESCE(), and it applies window framing to get the appropriate scope for each of those within each partition. Results:
person sequence_start_date sequence_end_date amount
1 September, 10 2015 00:00:00 September, 12 2015 00:00:00 500
1 September, 10 2015 00:00:00 September, 12 2015 00:00:00 100
1 October, 05 2015 00:00:00 October, 07 2015 00:00:00 2000
2 October, 05 2015 00:00:00 October, 06 2015 00:00:00 300
2 October, 05 2015 00:00:00 October, 06 2015 00:00:00 1000
3 April, 23 2015 00:00:00 April, 23 2015 00:00:00 900
Do note that that does not require an exact match of end date with subsequent start date; all rows for each person that abut or overlap will be assigned to the same sequence. If (person, start_date) cannot be relied upon to be unique, however, then you probably need to order the partitions by end date as well.
And now you have a way to identify the sequences: they are characterized by the triple person, sequence_start_date, sequence_end_date. (Or actually, you need only the person and one of those dates for identification purposes, but read on.) You can wrap the above query as an inline view of an outer aggregate query to produce your desired result:
SELECT
person,
sequence_start_date,
sequence_end_date,
SUM(amount) AS "amount"
FROM ( <above query> ) sq
GROUP BY person, sequence_start_date, sequence_end_date
Of course you need both dates as grouping columns if you're going to select them.

Why not:
select a1.person, a1.sequence_start_date, a1.sequence_end_date,
sum(rx.amount)
as amount
from (EXISTING_QUERY) a1
left join records rx
on rx.person = a1.person
and rx.start_date >= a1.start_date
and rx.end_date <= a1.end_date
group by a1.person, a1.sequence_start_date, a1.sequence_end_date

Related

Oracle SQL: How to fill Null value with data from most recent previous date that is not null?

Essentially date field is updated every month along with other fields, however one field is only updated ~6 times throughout the year. For months where that field is not updated, looking to show the most recent previous data
Date
Emp_no
Sales
Group
Jan
1234
100
Med
Feb
1234
200
---
Mar
1234
170
---
Apr
1234
150
Low
May
1234
180
---
Jun
1234
90
High
Jul
1234
100
---
Need it to show:
Date
Emp_no
Sales
Group
Jan
1234
100
Med
Feb
1234
200
Med
Mar
1234
170
Med
Apr
1234
150
Low
May
1234
180
Low
Jun
1234
90
High
Jul
1234
100
High
This field is not updated at set intervals, could be 1-4 months of Nulls in a row
Tried something like this to get the second most recent date but unsure how to deal with the fact that i could need between 1-4 months prior
LAG(Group)
OVER(PARTITION BY emp_no
ORDER BY date)
Thanks!
This is the traditional "gaps and islands" problem.
There are various ways to solve it, a simple version will work for you.
First, create a new identifier that splits the rows in to "groups", where only the first row in each group is NOT NULL.
SUM(CASE WHEN "group" IS NOT NULL THEN 1 ELSE 0 END) OVER (PARTION BY emp_no ORDER BY "date") AS emp_group_id
Then you can use MAX() in another window function, as all "groups" will only have one NOT NULL value.
WITH
gaps
AS
(
SELECT
t.*,
SUM(
CASE WHEN "group" IS NOT NULL
THEN 1
ELSE 0
END
)
OVER (
PARTITION BY emp_no
ORDER BY "date"
)
AS emp_group_id
FROM
your_table t
)
SELECT
"date",
emp_no,
sales,
MAX("group")
OVER (
PARTITION BY emp_no, emp_group_id
)
AS "group"
FROM
gaps
Edit
Ignore all that.
Oracle has IGNORE NULLS.
LAST_VALUE("group" IGNORE NULLS)
OVER (
PARTITION BY emp_no
ORDER BY "date"
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
)
AS "group"

how to aggregate one record multiple times based on condition

I have a bunch of records in the table below.
product_id produced_date expired_date
123 2010-02-01 2012-05-31
234 2013-03-01 2014-08-04
345 2012-05-01 2018-02-25
... ... ...
I want the output to display how many unexpired products currently we have at the monthly level. (Say, if a product expires on August 04, we still count it in August stock)
Month n_products
2010-02-01 10
2010-03-01 12
...
2022-07-01 25
2022-08-01 15
How should I do this in Presto or Hive? Thank you!
You can use below SQL.
Here we are using case when to check if a product is expired or not(produced_date >= expired_date ), if its expired, we are summing it to get count of product that has been expired. And then group that data over expiry month.
select
TRUNC(expired_date, 'MM') expired_month,
SUM( case when produced_date >= expired_date then 1 else 0 end) n_products
from mytable
group by 1
We can use unnest and sequence functions to create a derived table; Joining our table with this derived table, should give us the desired result.
Select m.month,count(product_id) as n_products
(Select
(select x
from unnest(sequence(Min(month(produced_date)), Max(month(expired_date)), Interval '1' month)) t(x)
) as month
from table) m
left join table t on m.month >= t.produced_date and m.month <= t.expired_date
group by 1
order by 1

Count distinct customers who bought in previous period and not in next period Bigquery

I have a dataset in bigquery which contains order_date: DATE and customer_id.
order_date | CustomerID
2019-01-01 | 111
2019-02-01 | 112
2020-01-01 | 111
2020-02-01 | 113
2021-01-01 | 115
2021-02-01 | 119
I try to count distinct customer_id between the months of the previous year and the same months of the current year. For example, from 2019-01-01 to 2020-01-01, then from 2019-02-01 to 2020-02-01, and then who not bought in the same period of next year 2020-01-01 to 2021-01-01, then 2020-02-01 to 2021-02-01.
The output I am expect
order_date| count distinct CustomerID|who not buy in the next period
2020-01-01| 5191 |250
2020-02-01| 4859 |500
2020-03-01| 3567 |349
..........| .... |......
and the next periods shouldn't include the previous.
I tried the code below but it works in another way
with customers as (
select distinct date_trunc(date(order_date),month) as dates,
CUSTOMER_WID
from t
where date(order_date) between '2018-01-01' and current_date()-1
)
select
dates,
customers_previous,
customers_next_period
from
(
select dates,
count(CUSTOMER_WID) as customers_previous,
count(case when customer_wid_next is null then 1 end) as customers_next_period,
from (
select prev.dates,
prev.CUSTOMER_WID,
next.dates as next_dates,
next.CUSTOMER_WID as customer_wid_next
from customers as prev
left join customers
as next on next.dates=date_add(prev.dates,interval 1 year)
and prev.CUSTOMER_WID=next.CUSTOMER_WID
) as t2
group by dates
)
order by 1,2
Thanks in advance.
If I understand correctly, you are trying to count values on a window of time, and for that I recommend using window functions - docs here and here a great article explaining how it works.
That said, my recommendation would be:
SELECT DISTINCT
periods,
COUNT(DISTINCT CustomerID) OVER 12mos AS count_customers_last_12_mos
FROM (
SELECT
order_date,
FORMAT_DATE('%Y%m', order_date) AS periods,
customer_id
FROM dataset
)
WINDOW 12mos AS ( # window of last 12 months without current month
PARTITION BY periods ORDER BY periods DESC
ROWS BETWEEN 12 PRECEEDING AND 1 PRECEEDING
)
I believe from this you can build some customizations to improve the aggregations you want.
You can generate the periods using unnest(generate_date_array()). Then use joins to bring in the customers from the previous 12 months and the next 12 months. Finally, aggregate and count the customers:
select period,
count(distinct c_prev.customer_wid),
count(distinct c_next.customer_wid)
from unnest(generate_date_array(date '2020-01-01', date '2021-01-01', interval '1 month')) period join
customers c_prev
on c_prev.order_date <= period and
c_prev.order_date > date_add(period, interval -12 month) left join
customers c_next
on c_next.customer_wid = c_prev.customer_wid and
c_next.order_date > period and
c_next.order_date <= date_add(period, interval 12 month)
group by period;

record for last two month and their difference in oracle

i need variance for last two month and i am using below query
with Positions as
(
select
COUNT(DISTINCT A_SALE||B_SALE) As SALES,
TO_CHAR(DATE,'YYYY-MON') As Period
from ORDERS
where DATE between date '2020-02-01' and date '2020-02-29'
group by TO_CHAR(DATE,'YYYY-MON')
union all
select
COUNT(DISTINCT A_SALE||B_SALE) As SALES,
TO_CHAR(DATE,'YYYY-MON') As Period
from ORDERS
where DATE between date '2020-03-01' and date '2020-03-31'
group by TO_CHAR(DATE,'YYYY-MON')
)
select
SALES,
period,
case when to_char(round((SALES-lag(SALES,1, SALES) over (order by period desc))/ SALES*100,2), 'FM999999990D9999') <0
then to_char(round(abs( SALES-lag(SALES,1, SALES) over (order by period desc))/ SALES*100,2),'FM999999990D9999')||'%'||' (Increase) '
when to_char(round((SALES-lag(SALES,1,SALES) over (order by period desc))/SALES*100,2),'FM999999990D9999')>0
then to_char(round(abs(SALES-lag(SALES,1, SALES) over (order by period desc ))/SALES*100,2),'FM999999990D9999')||'%'||' (Decrease) '
END as variances
from Positions
order by Period;
i am getting output like this
SALES | Period | variances
---------|------------------|--------------------
100 | 2020-FEB | 100%(Increase)
200 | 2020-MAR | NULL
i want record something like that where variance in front of march instead of feb as we are looking variance for the latest month
SALES | Period | variances
---------|------------------|--------------------
200 | 2020-MAR | 100%(Increase)
100 | 2020-FEB | NULL
I did not analyze the query in too much detail but you have one obvious flaw.
You change your period from a date to char.
That means when you apply your window functions your ordering will not work as expected.
a date ordered desc will look like (based on chronological ordering)
MAR - 2020
FEB - 2020
JAN - 2020
Text ordered desc will look like (based on alphabetical ordering)
MAR - 2020
JAN - 2020
FEB - 2020
That being said, you are comparing a 'good' case (FEB + MAR) where both the text ordering and date ordering will work the same way.
The implied ordering is ASCENDING. So at the end when you do
order by Period;
it will display February first and then March. If you do
order by Period DESC;
you will get March displayed first.

Get sum of a column by grouping based on a range given in 2 columns

I have a sql command that gives me results in the following columns
start_date, end_date, count, weekday
I want to get, for each start_date, the sum of the count from the start_date to its end_date where weekdays match.
So for example, if I have a row with start_date = 2012 01 01 and end_date = 2012 08 08 and weekday = Tuesday, I want to find all the other rows that have the a start_date that falls within that range AND that its a Tuesday, then find the sum of the counts. How can I achieve that?
E.g. From this table
Start || End ||Count|| Weekday
2012-01-01 || 2012-12-12 || 5 || Tuesday
2012-05-05 || 2012-12-12 || 7 || Tuesday
2012-06-06 || 2012-10-10 || 2 || Wednesday
2012-07-07 || 2012-08-08 || 8 || Wednesday
2012-09-09 || 2012-10-10 || 9 || Tuesday
It should return
date | sum_count
2012-01-01 | 16 // count of 2012-05-05 + 2012-09-09 (Tuesdays only)
2012-05-05 | 9
2012-06-06 | 8
2012-07-07 | 0
2012-09-09 | 0
Without a fiddle, sqlfiddle.com it will be hard to correctly get this the first try. But what you want to do is something along these lines:
select count(*), *
from
(
select *
from
(
select start_date,end_date,weekday
from table
where start_date >= timestamp('2012 01 01')
and end_date <= timestamp('2012 08 08')
)
where weekday = 'Tuesday'
);
The objective is to reduce your result set each time, by keeping weekday in a separate subquery you can potentially avoid a costly join or 2.
Question
huh? I still don't understand though. 2012 08 08, 2012 01 01 and
Tuesday are from the input table, and there are multiple rows that I
need to process. Are you saying that processing each row separately is
more efficient?
You have to process each row individually, unless you know someway to avoid a full table scan when searching against dates. This hinges on comparing the explain plans, which we do not have as we are still awaiting your fiddle.
The key to this is the inner most query will give you the date range that you want, with all the days of the week. It is more efficient (most of the time), to then execute against the more specific where clause, in your case the day of the week. The reason for this is the database (most modern ones do this) tries to order the data in such a way that it can return as quickly as possible.
Extra update
As a real world example of this, I have a table with close to ~1 billion entries in it that I must run an analytic function over. The first way I did this was as so:
select *
from
(
select *, row_number() over (partition by id order by seen desc) rn
from foo
)where rn =1
and status = 1
The above would take about 9 minutes to execute. When I modified the query to be this:
select *
from
( select *
from
(
select *, row_number() over (partition by id order by seen desc) rn
from foo
)where status = 1
) where status = 1
it returns in just under 1 minute. This is an example where I carefully reduced the size of the driving result set so that the system would do less work and thus return more quickly.
i Hope this is your requirement... this one works in oracle with your sample data
select TAB.START_DATE START_DATE, nvl(X1.SUM_COUNT,0) SUM_COUNT
from TABLE2 TAB,
( select A1.START_DATE,SUM(A2.COUNT) SUM_COUNT
from TABLE2 A1,TABLE2 A2
where A1.WEEKDAY=A2.WEEKDAY and A1.rowid <> A2.rowid
and A2.START_DATE between A1.START_DATE and A1.END_DATE
group by A1.START_DATE
) X1
where TAB.START_DATE=X1.START_DATE(+) order by 1
please refer this sql fiddle: http://sqlfiddle.com/#!4/2019f/4
Try this, I believe self join is the best option
select b.start_date,nvl(sum(a.Count),0) from TABLE2 a right join TABLE2 b on
a.start_date<>b.start_date and
a.weekday=b.weekday and a.start_date between b.start_date and b.end_date
group by b.start_date order by b.start_date
fiddledemo