Grouping rows if their dates overlap, and ranking them - sql

My situation is I have a table of transactions, with start and end dates. The problem is that often times these transaction dates overlap with each other, and I want to group these scenarios together.
For example in the case below, transaction #1 is the "root" transaction, while #2-#4 are overlapping with #1 and/or with each other. However, transaction #5 is not overlapping with anything, hence it is a new "root" transaction.
+----------------+-----------+-----------+----------------------------------+
| Transaction ID | StartDate | EndDate | |
+----------------+-----------+-----------+----------------------------------+
| 1 | 1/1/2017 | 1/3/2017 | root transaction |
| 2 | 1/2/2017 | 1/6/2017 | overlaps with #1 |
| 3 | 1/5/2017 | 1/10/2017 | overlaps with #2 |
| 4 | 1/3/2017 | 1/13/2017 | overlaps with #2 and #3 |
| 5 | 1/15/2017 | 1/20/2017 | no overlap, new root transaction |
+----------------+-----------+-----------+----------------------------------+
Below is how I want the output to look. I want to
Identify the root transaction (column 4)
Rank the transactions in a chain by EndDate, so that the root is always = 1
+----------------+-----------+-----------+------------------+------+
| Transaction ID | Start | End | Root Transaction | Rank |
+----------------+-----------+-----------+------------------+------+
| 1 | 1/1/2017 | 1/3/2017 | 1 | 1 |
| 2 | 1/2/2017 | 1/6/2017 | 1 | 2 |
| 3 | 1/5/2017 | 1/10/2017 | 1 | 3 |
| 4 | 1/3/2017 | 1/13/2017 | 1 | 4 |
| 5 | 1/15/2017 | 1/20/2017 | 5 | 1 |
+----------------+-----------+-----------+------------------+------+
How would I go about this in SQL?

Here is one method using an OUTER APPLY
Declare #YourTable table ([Transaction ID] int,StartDate date,EndDate date)
Insert Into #YourTable values
(1,'1/1/2017','1/3/2017'),
(2,'1/2/2017','1/6/2017'),
(3,'1/5/2017','1/10/2017'),
(4,'1/3/2017','1/13/2017'),
(5,'1/15/2017','1/20/2017')
Select [Transaction ID]
,[Start] = StartDate
,[End] = EndDate
,[Root Transaction]=Grp
,[Rank] = Row_Number() over (Partition By Grp Order by [Transaction ID])
From (
Select A.*
,Grp = max(Flag*[Transaction ID]) over (Order By [Transaction ID])
From (
Select A.*,Flag = IsNull(B.Flg,1)
From #YourTable A
Outer Apply (
Select Top 1 Flg=0
From #YourTable
Where (StartDate between A.StartDate and A.EndDate
or EndDate between A.StartDate and A.EndDate )
and [Transaction ID]<A.[Transaction ID]
) B
) A
) A
Returns
EDIT - Some Commentary
In the OUTER APPLY, Flag will be set to 1 or 0. 1 Indicates a New Group. 0 Indicates that the record will overlap with an existing range
Then the next query "up", We use the window function to apply a Grp Code (Flag*Trans ID). Remember a new group is 1 and existing is 0.
Now the window function will take max of this product, as it traverses the Transactions.
The final query is just to apply the Rank using the window function partition by the Grp, Order by Trans ID
If it helps with the visualization:
The 1st sub-query (outer apply) genererates
The 2nd sub-query generates

This is an example of "gaps-and-islands". For your data, you can determine the "island"s by determining where each starts -- that is, where a record does not overlap with the previous record. You can then get the rank using row_number().
So, here is a method:
select t.*,
min(transactionId) over (partition by island) as start,
row_number() over (partition by island order by endDate) as rnk
from (select t.*,
sum(startIslandFlag) over (order by startDate) as island
from (select t.*,
(case when not exists (select 1
from t t2
where t2.startdate < t.startdate and
t2.enddate >= t.startdate
)
then 1 else 0
end) as startIslandFlag
from t
) t
) t;
Notes:
In the event that the lowest transaction id is not the "root", then a tweak may be needed to the code to get the transaction id with the minimum start date.
If there are duplicate start dates in the code, a tweak may be needed with the cumulative sums (using an explicit range window).

Identify the root transactions:
with roots as (
select *
from tran as t1
where not exists (
select 1
from tran as t2
where t2.Transaction_ID < t1.Transaction_ID
and (
t1.StartDate between t2.StartDate and t2.EndDate
or
t1.EndDate between t2.StartDate and t2.EndDate
)
)
)
Create a two root system to capture all the overlaps in between them
select t.Transaction_ID,
t.StartDate as [Start],
t.EndDate as [End],
r1.Transaction_ID as Root_Transaction,
row_number() over (partition by r1.Transaction_ID order by t.EndDate) as [Rank]
from roots as r1
inner join roots as r2
on r2.Transaction_ID > r1.Transaction_ID
inner join tran as t
on t.Transaction_ID >= r1.Transaction_ID
and t.Transaction_ID < r2.Transaction_ID
where not exists ( --this "not exists" makes sure r1 and r2 are consequetive roots
select 1
from roots as r3
where r3.Transaction_ID > r1.Transaction_ID
and r3.Transaction_ID < r2.Transaction_ID
)

Related

How to create BigQuery this query in retail dataset

I have a table with user retail transactions. It includes sales and cancels. If Qty is positive - it sells, if negative - cancels. I want to attach cancels to the most appropriate sell. So, I have tables likes that:
| CustomerId | StockId | Qty | Date |
|--------------+-----------+-------+------------|
| 1 | 100 | 50 | 2020-01-01 |
| 1 | 100 | -10 | 2020-01-10 |
| 1 | 100 | 60 | 2020-02-10 |
| 1 | 100 | -20 | 2020-02-10 |
| 1 | 100 | 200 | 2020-03-01 |
| 1 | 100 | 10 | 2020-03-05 |
| 1 | 100 | -90 | 2020-03-10 |
User with ID 1 has the following actions: buy 50 -> return 10 -> buy 60 -> return 20 -> buy 200 -> buy 10 - return 90. For each cancel row (with negative Qty) I find the previous row (by Date) with positive Qty and greater than cancel Qty.
So I need to create BigQuery queries to create table likes this:
| CustomerId | StockId | Qty | Date | CancelQty |
|--------------+-----------+-------+------------+-------------|
| 1 | 100 | 50 | 2020-01-01 | -10 |
| 1 | 100 | 60 | 2020-02-10 | -20 |
| 1 | 100 | 200 | 2020-03-01 | -90 |
| 1 | 100 | 10 | 2020-03-05 | 0 |
Does anybody help me with these queries? I have created one candidate query (split cancel and sales, join them, and do some staff for removing), but it works incorrectly in the above case.
I use BigQuery, so any BQ SQL features could be applied.
Any ideas will be helpful.
You can use the following query.
;WITH result AS (
select t1.*,t2.Qty as cQty,t2.Date as Date_t2 from
(select *,ROW_NUMBER() OVER (ORDER BY qty DESC) AS [ROW NUMBER] from Test) t1
join
(select *,ROW_NUMBER() OVER (ORDER BY qty) AS [ROW NUMBER] from Test) t2
on t1.[ROW NUMBER] = t2.[ROW NUMBER]
)
select CustomerId,StockId,Qty,Date,ISNULL(cQty, 0) As CancelQty,Date_t2
from (select CustomerId,StockId,Qty,Date,case
when cQty < 0 then cQty
else NULL
end AS cQty,
case
when cQty < 0 then Date_t2
else NULL
end AS Date_t2 from result) t
where qty > 0
order by cQty desc
result: https://dbfiddle.uk
You can do this as a gaps-and-islands problem. Basically, add a grouping column to the rows based on a cumulative reverse count of negative values. Then within each group, choose the first row where the sum is positive. So:
select t.* (except cancelqty, grp),
(case when min(case when cancelqty + qty >= 0 then date end) over (partition by customerid grp) = date
then cancelqty
else 0
end) as cancelqty
from (select t.*,
min(cancelqty) over (partition by customerid, grp) as cancelqty
from (select t.*,
countif(qty < 0) over (partition by customerid order by date desc) as grp
from transactions t
) t
from t
) t;
Note: This works for the data you have provided. However, there may be complicated scenarios where this does not work. In fact, I don't think there is a simple optimal solution assuming that the returns are not connected to the original sales. I would suggest that you fix the data model so you record where the returns come from.
The below query seems to satisfy the conditions and the output mentioned.The solution is based on mapping the base table (t) and having the corresponding canceled qty row alongside from same table(t1)
First, a self join based on the customer and StockId is done since they need to correspond to the same customer and product.
Additionally, we are bringing in the canceled transactions t1 that happened after the base row in table t t.Dt<=t1.Dt and to ensure this is a negative qty t1.Qty<0 clause is added
Further we cannot attribute the canceled qty if they are less than the Original Qty. Therefore I am checking if the positive is greater than the canceled qty. This is done by adding a '-' sign to the cancel qty so that they can be compared easily. -(t1.Qty)<=t.Qty
After the Join, we are interested only in the positive qty, so adding a where clause to filter the other rows from the base table t with canceled quantities t.Qty>0.
Now we have the table joined to every other canceled qty row which is less than the transaction date. For example, the Qty 50 can have all the canceled qty mapped to it but we are interested only in the immediate one came after. So we first group all the base quantity values and then choose the date of the canceled Qty that came in first in the Having clause condition HAVING IFNULL(t1.dt, '0')=MIN(IFNULL(t1.dt, '0'))
Finally we get the rows we need and we can exclude the last column if required using an outer select query
SELECT t.CustomerId,t.StockId,t.Qty,t.Dt,IFNULL(t1.Qty, 0) CancelQty
,t1.dt dt_t1
FROM tbl t
LEFT JOIN tbl t1 ON t.CustomerId=t1.CustomerId AND
t.StockId=t1.StockId
AND t.Dt<=t1.Dt AND t1.Qty<0 AND -(t1.Qty)<=t.Qty
WHERE t.Qty>0
GROUP BY 1,2,3,4
HAVING IFNULL(t1.dt, '0')=MIN(IFNULL(t1.dt, '0'))
ORDER BY 1,2,4,3
fiddle
Consider below approach
with sales as (
select * from `project.dataset.table` where Qty > 0
), cancels as (
select * from `project.dataset.table` where Qty < 0
)
select any_value(s).*,
ifnull(array_agg(c.Qty order by c.Date limit 1)[offset(0)], 0) as CancelQty
from sales s
left join cancels c
on s.CustomerId = c.CustomerId
and s.StockId = c.StockId
and s.Date <= c.Date
and s.Qty > abs(c.Qty)
group by format('%t', s)
if applied to sample data in your question - output is

Counting current items by month

I'm trying to build a monthly tally of active equipment, grouped by service area from a database log table. I think I'm 90% of the way there; I have a list of months, along with the total number of items that existed, and grouped by region.
However, I also need to know the state of each item as they were on the first of each month, and this is the part I'm stuck on. For instance, Item 1 is in region A in January, but moves to Region B in February. Item 2 is marked as 'inactive' in February, so shouldn't be counted. My existing query will always count item 1 in region A, and item 2 as 'active'.
I can correctly show that Item 3 is deleted in March, and Item 4 doesn't show up until the April count. I realize that I'm getting the first values because my query is specifying the min date, I'm just not sure how I need to change it to get what I want.
I think I'm looking for a way to group by Max(OperationDate) for each Month.
The Table looks like this:
| EQUIPID | EQUIPNAME | EQUIPACTIVE | DISTRICT | REGION | OPERATIONDATE | OPERATION |
|---------|-----------|-------------|----------|--------|----------------------|-----------|
| 1 | Item 1 | 1 | 1 | A | 2015-01-01T00:00:00Z | INS |
| 2 | Item 2 | 1 | 1 | A | 2015-01-01T00:00:00Z | INS |
| 3 | Item 3 | 1 | 1 | A | 2015-01-01T00:00:00Z | INS |
| 2 | Item 2 | 0 | 1 | A | 2015-02-10T00:00:00Z | UPD |
| 1 | Item 1 | 1 | 1 | B | 2015-02-15T00:00:00Z | UPD |
| 3 | (null) | (null) | (null) | (null) | 2015-02-21T00:00:00Z | DEL |
| 1 | Item 1 | 1 | 1 | A | 2015-03-01T00:00:00Z | UPD |
| 4 | Item 4 | 1 | 1 | B | 2015-03-10T00:00:00Z | INS |
There is also a subtable that holds attributes that I care about. It's structure is similar. Unfortunately, due to previous design decisions, there is no correlation to operations between the two tables. Any joins will need to be done using the EquipmentID, and have the overlapping states matched up for each date.
Current query:
--cte to build date list
WITH calendar (dt) AS
(SELECT &fromdate from dual
UNION ALL
SELECT Add_Months(dt,1)
FROM calendar
WHERE dt < &todate)
SELECT dt, a.district, a.region, count(*)
FROM
(SELECT EQUIPID, DISTRICT, REGION, OPERATION, MIN(OPERATIONDATE ) AS FirstOp, deleted.deldate
FROM Equipment_Log
LEFT JOIN
(SELECT EQUIPID,MAX(OPERATIONDATE) as DelDate
FROM Equipment_Log
WHERE OPERATION = 'DEL'
GROUP BY EQUIPID
) Deleted
ON Equipment_Log.EQUIPID = Deleted.EQUIPID
WHERE OPERATION <> 'DEL' --AND additional unimportant filters
GROUP BY EQUIPID,DISTRICT, REGION , OPERATION, deldate
) a
INNER JOIN calendar
ON (calendar.dt >= FirstOp AND calendar.dt < deldate)
OR (calendar.dt >= FirstOp AND deldate is null)
LEFT JOIN
( SELECT EQUIPID, MAX(OPERATIONDATE) as latestop
FROM SpecialEquip_Table_Log
--where SpecialEquip filters
group by EQUIPID
) SpecialEquip
ON a.EQUIPID = SpecialEquip.EQUIPID and calendar.dt >= SpecialEquip.latestop
GROUP BY dt, district, region
ORDER BY dt, district, region
Take only last operation for each id. This is what row_number() and where rn = 1 do.
We have calendar and data. Make partitioned join.
I assumed that you need to fill values for months where entries for id are missing. So nvl(lag() ignore nulls) are needed, because if something appeared in January it still exists in Feb, March and we need district, region values from last not empty row.
Now you have everything to make count. That part where you mentioned SpecialEquip_Table_Log is up to you, because you left-joined this table and not used it later, so what is it for? Join if you need it, you have id.
db<>fiddle
with
calendar(mth) as (
select date '2015-01-01' from dual union all
select add_months(mth, 1) from calendar where mth < date '2015-05-01'),
data as (
select id, dis, reg, dt, op, act
from (
select equipid id, district dis, region reg,
to_char(operationdate, 'yyyy-mm') dt,
row_number()
over (partition by equipid, trunc(operationdate, 'month')
order by operationdate desc) rn,
operation op, nvl(equipactive, 0) act
from t)
where rn = 1 )
select mth, dis, reg, sum(act) cnt
from (
select id, mth,
nvl(dis, lag(dis) ignore nulls over (partition by id order by mth)) dis,
nvl(reg, lag(reg) ignore nulls over (partition by id order by mth)) reg,
nvl(act, lag(act) ignore nulls over (partition by id order by mth)) act
from calendar
left join data partition by (id) on dt = to_char(mth, 'yyyy-mm') )
group by mth, dis, reg
having sum(act) > 0
order by mth, dis, reg
It may seem complicated, so please run subqueries separately at first to see what is going on. And test :) Hope this helps.

How to get first_value from previous window partition

I want to display the the BalanceEndOfYesterday Value from the day before in a query as shown below.
| Date | Amout | BalanceEndOfDay | BalanceEndOfYesterday |
|------------|-------|-----------------|-----------------------|
| 2020-04-30 | 10 | 130 | 80 |
| 2020-04-30 | 20 | 130 | 80 |
| 2020-04-30 | 30 | 130 | 80 |
| 2020-04-30 | -10 | 130 | 80 |
| 2020-04-29 | 50 | 80 | 0 |
| 2020-04-29 | -10 | 80 | 0 |
| 2020-04-29 | 40 | 80 | 0 |
My query is
SELECT
BalanceEndOfDay ,
first_value(BalanceEndOfDay) OVER (ORDER BY Date DESC) -- here is some sort of window needed
FROM AccountTransactions
You can use apply :
SELECT at.*, COALESCE(at1.BalanceEndOfDay, 0) AS BalanceEndOfYesterday
FROM AccountTransactions at OUTER APPLY
( SELECT TOP (1) at1.BalanceEndOfDay
FROM AccountTransactions at1
WHERE at1.Date < at.Date
ORDER BY at1.Date DESC
) at1;
EDIT : If you want yesterday only balance then you can use dateadd() :
SELECT DISTINCT at.*, COALESCE(at1.balanceendofday, 0) AS BalanceEndOfYesterday
FROM AccountTransactions at LEFT JOIN
AccountTransactions at1
ON at1.date = dateadd(day, -1, at.date);
We could use LAG here, after first aggregating by date to obtain a single end of day balance for each date. Then, we can join your table to this result to pull in the end of day balance from yesterday.
WITH cte AS (
SELECT Date, MAX(BalanceEndOfDay) AS BalanceEndOfDay,
LAG(MAX(BalanceEndOfDay), 1, 0) OVER (ORDER BY Date) As BalanceEndOfYesterday
FROM AccountTransactions
GROUP BY Date
)
SELECT
a1.Date,
a1.Amount,
a1.BalanceEndOfDay,
a2.BalanceEndOfYesterday
FROM AccountTransactions a1
INNER JOIN cte a2
ON a1.Date = a2.Date
ORDER BY
a1.Date DESC;
Demo
If you want to do this using only window functions, you can use:
select at.*,
max(case when prev_date = dateadd(day, -1, date) then prev_BalanceEndOfDay end) over (partition by date) as prev_BalanceEndOfDay
from (select at.*,
lag(BalanceEndOfDay) over (order by date) as prev_BalanceEndOfDay,
lag(date) over (order by date) as prev_date
from accounttransactions at
) at;
Note: This interprets "the day before" as being exactly one day before. It is means "the day before in the data", then the first comparison should just be max(case when prev_date <> date . . . ).
Here is a db<>fiddle.
Note that in databases that fully support the range window specification, this can be done directly with logic like this:
max(BalanceEndOfDay) over (order by datediff(day, '2000-01-01', date)
range between 1 preceding and 1 preceding
)
Alas, SQL Server does not support this (standard) functionality.

Calculate total time spent by group and one datetime column

I have a workflow application where the workflow is written to the DB as shown below when the status changes. There is no end time as it is a sequence of events. I want to create a query that will group by the WorkFlowID and total the amount of time spent in each. I am not sure how to even begin
My table and data looks like this
+------------+---------------------+
| WorkFlowID | EventTime |
+------------+---------------------+
| 1 | 07/15/2015 12:00 AM |
| 2 | 07/15/2015 12:10 AM |
| 3 | 07/15/2015 12:20 AM |
| 2 | 07/15/2015 12:30 AM |
| 3 | 07/15/2015 12:40 AM |
| 4 | 07/15/2015 12:50 AM |
+------------+---------------------+
My end result should be like:
+------------+-----------------+
| WorkFlowID | TotalTimeInMins |
+------------+-----------------+
| 1 | 10 |
| 2 | 20 |
| 3 | 20 |
| 4 | 10 |
+------------+-----------------+
In SQL Server 2012+, you would just use lead(). There are several ways to approach this in SQL Server 2008. Here is one using `outer apply:
select t.WorkFlowId,
sum(datediff(second, EventTime, nextTime)) / 60.0 as NumMinutes
from (select t.*, t2.EventTime as nextTime
from table t outer apply
(select top 1 t2.*
from table t2
where t2.EventTime > t.EventTime
order by t2.EventTime
) t2
) tt
group by t.WorkFlowId;
The only question is how you get "10" for event 4. There is no following event, so that value doesn't make sense. You can use datediff(second, EventTime coalesce(NextEvent, getdate()) to handle the NULL value.
As an alternative:
;WITH t AS (
SELECT *,
ROW_NUMBER() OVER (ORDER BY EventTime) As rn
FROM
yourTable)
SELECT
t1.WorkFlowID,
SUM(DATEDIFF(SECOND, t1.EventTime, ISNULL(t2.EventTime, GETDATE()) / 60) As TotalTimeInMins
FROM t t1
LEFT JOIN t t2
ON t1.rn = t2.rn - 1
The basis of a method that works in all (ok, I don't know about SQL 6.5) editions is to use the group by clause:
SELECT
WorkFlowID
,datediff(mi, min(EventTime), max(EventTime)) TotalTimeInMins
from MyTable
group by WorkFlowID
This does indeed leave the question of how you got 10 minutes with a start time and (presumably) no end time. As written, this query would list the
WorkFlowID with TotalTimeInMins = 0, which seems accurate enough. The following variant would remove all "start-only" items:
SELECT
WorkFlowID
,datediff(mi, min(EventTime), max(EventTime)) TotalTimeInMins
from MyTable
group by WorkFlowID
having count(*) > 1
(The quick explanation: having is to group by as where is to from)

Group rows into sequences using a sliding window on a DateTime column

I have a table that stores timestamped events. I want to group the events into 'sequences' by using 5-min sliding window on the timestamp column, and write the 'sequence ID' (any ID that can distinguish sequences) and 'order in sequence' into another table.
Input - event table:
+----+-------+-----------+
| Id | Name | Timestamp |
+----+-------+-----------+
| 1 | test | 00:00:00 |
| 2 | test | 00:06:00 |
| 3 | test | 00:10:00 |
| 4 | test | 00:14:00 |
+----+-------+-----------+
Desired output - sequence table. Here SeqId is the ID of the starting event, but it doesn't have to be, just something to uniquely identify a sequence.
+---------+-------+----------+
| EventId | SeqId | SeqOrder |
+---------+-------+----------+
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 2 | 2 |
| 4 | 2 | 3 |
+---------+-------+----------+
What would be the best way to do it? This is MSSQL 2008, I can use SSAS and SSIS if they make things easier.
CREATE TABLE #Input (Id INT, Name VARCHAR(20), Time_stamp TIME)
INSERT INTO #Input
VALUES
( 1 ,'test','00:00:00' ),
( 2 ,'test','00:06:00' ),
( 3 ,'test','00:10:00' ),
( 4 ,'test','00:14:00' )
SELECT * FROM #Input;
WITH cte AS -- add a sequential number
(
SELECT *,
ROW_NUMBER() OVER(ORDER BY Id) AS sort
FROM #Input
), cte2 as -- find the Id's with a difference of more than 5min
(
SELECT cte.*,
CASE WHEN DATEDIFF(MI, cte_1.Time_stamp,cte.Time_stamp) < 5 THEN 0 ELSE 1 END as GrpType
FROM cte
LEFT OUTER JOIN
cte as cte_1 on cte.sort =cte_1.sort +1
), cte3 as -- assign a SeqId
(
SELECT GrpType, Time_Stamp,ROW_NUMBER() OVER(ORDER BY Time_stamp) SeqId
FROM cte2
WHERE GrpType = 1
), cte4 as -- find the Time_Stamp range per SeqId
(
SELECT cte3.*,cte_2.Time_stamp as TS_to
FROM cte3
LEFT OUTER JOIN
cte3 as cte_2 on cte3.SeqId =cte_2.SeqId -1
)
-- final query
SELECT
t.Id,
cte4.SeqId,
ROW_NUMBER() OVER(PARTITION BY cte4.SeqId ORDER BY t.Time_stamp) AS SeqOrder
FROM cte4 INNER JOIN #Input t ON t.Time_stamp>=cte4.Time_stamp AND (t.Time_stamp <cte4.TS_to OR cte4.TS_to IS NULL);
This code is slightly more complex but it returns the expected output (which Gordon Linoffs solution doesn't...) and it's even slightly faster.
You seem to want things grouped together when they are less than five minutes apart. You can assign the groups by getting the previous time stamp and marking the beginning of a group. You then need to do a cumulative sum to get the group id:
with e as (
select e.*,
(case when datediff(minute, prev_timestamp, timestamp) < 5 then 1 else 0 end) as flag
from (select e.*,
(select top 1 e2.timestamp
from events e2
where e2.timestamp < e.timestamp
order by e2.timestamp desc
) as prev_timestamp
from events e
) e
)
select e.eventId, e.seqId,
row_number() over (partition by seqId order b timestamp) as seqOrder
from (select e.*, (select sum(flag) from e e2 where e2.timestamp <= e.timestamp) as seqId
from e
) e;
By the way, this logic is easier to express in SQL Server 2012+ because the window functions are more powerful.