Identify value changes in history table - sql

I have following table, which apart from other attributes contains:
Customer ID - unique identifier
Value
CreatedDate - when the record has been created (based on ETL)
UpdatedDate - until when the record has been valid
Since there are other attributes apart from the [Value], which are being tracked for historical values, there might be cases, where there are multiple rows with the same [Value] for the same customer, but different timestamps in [CreatedDate] / [UpdatedDate]. Thus, the data may look like:
Customer ID
Value
CreatedDate
UpdatedDate
1
111
04/08/2021 15:00
04/08/2021 17:00
1
111
01/08/2021 09:00
04/08/2021 15:00
1
222
20/07/2021 01:30
01/08/2021 09:00
1
222
01/06/2021 08:00
20/07/2021 01:30
1
111
01/04/2021 07:15
01/06/2021 08:00
2
333
03/08/2021 04:30
04/08/2021 17:00
2
444
23/07/2021 01:20
03/08/2021 04:30
2
444
01/04/2021 13:50
23/07/2021 01:20
I would like to keep the unique [Values] in correct sequence, hence keep the [Value] for the earliest [CreatedDate], however, if Customer had originally Value1, then changed it to Value2 and finally, changed back to Value1. I would like to keep these 2 changes as well. Hence the ideal output should look like:
Customer ID
Value
CreatedDate
UpdatedDate
1
111
01/08/2021 09:00
04/08/2021 17:00
1
222
01/06/2021 08:00
01/08/2021 09:00
1
111
01/04/2021 07:15
01/06/2021 08:00
2
333
03/08/2021 04:30
04/08/2021 17:00
2
444
01/04/2021 13:50
03/08/2021 04:30
Based on CreatedDate / UpdatedDate identify, the chronological sequence of changes and identify the earliest CreatedDate and latest UpdatedDate. However, if particular value appeared multiple times, but has been interspersed by different value, I would like to keep it too.
I've tried the below approach and it works fine however it does not work for the scenario above and the output look like:
SELECT [Customer ID]
,Value
,MIN(CreatedDate) as CreatedDate
,MAX(UpdatedDate) as UpdatedDate
FROM #History
GROUP BY ID, Value
Customer ID
Value
CreatedDate
UpdatedDate
1
111
01/04/2021 07:15
04/08/2021 17:00
1
222
01/06/2021 08:00
01/08/2021 09:00
2
333
03/08/2021 04:30
04/08/2021 17:00
2
444
01/04/2021 13:50
03/08/2021 04:30
Any ideas, please? I've tried using LAG and LEAD as well, but was not able to make it work either.

This is a type of gaps-and-island problem that is probably best solved by looking for overlaps using a cumulative maximum:
select customerid, min(createddate), max(updateddate)
from (select t.*,
sum(case when prev_updatedate >= createddate then 0 else 1 end) over (partition by customerid, value order by createddate) as grp
from (select h.*,
max(updateddate) over (partition by customerid, value order by createddate rows between unbounded preceding and 1 preceding) as prev_updatedate
from #history h
) h
) h
group by customerid, value, grp;
The logic is to look at the most recent updatedate before each row for each customer and value. If this is earlier than the row's create date, then this starts are new group.
The final result is just aggregating the rows in each group.

Related

How do I calculate the amount of time between multiple datetimes in multiple rows in sql

I've done a search but I can't find any that are exactly what I need. I need to be able to calculate the amount of time that someone has been in the building over time in a sql query (T-SQL on SQL Server). The data looks like this:
UserId Clocking Status
------------------------------
1 01/12/2020 09:00 In
2 01/12/2020 09:12 In
1 01/12/2020 09:25 Out
3 01/12/2020 10:00 In
2 01/12/2020 10:45 Out
3 01/12/2020 13:11 Out
1 03/12/2020 11:14 In
2 03/12/2020 15:56 In
1 03/12/2020 16:04 Out
2 03/12/2020 17:00 Out
I want the output to look like this:
UserId TimeInBuilding
----------------------
1 03:35
2 05:25
3 03:11
Assuming that the ins/outs are perfectly interleaved, you can do this by assigning the next "out" time to the "in" time and aggregating:
select userid,
sum(datediff(second, clocking, out_time)) / (60.0 * 60) as decimal_hours
from (select t.*,
lead(clocking) over (partition by userid order by clocking) as out_time
from t
) t
where status = 'In'
group by userid;
You can convert this to HH:MM format using:
select userid,
convert(varchar(5),
convert(time,
dateadd(second,
sum(datediff(second, clocking, out_time),
0)
)
) as hhmm
from (select t.*,
lead(clocking) over (partition by userid order by clocking) as out_time
from t
) t
where status = 'In'
group by userid;
Here is a db<>fiddle.

Grouping sets of data in Oracle SQL

I have been trying to separate groups in data being stored on my oracle database for more accurate analysis.
Current Output
Time Location
10:00 A111
11:00 A112
12:00 S111
13:00 S234
17:00 A234
18:00 S747
19:00 A878
Desired Output
Time Location Group Number
10:00 A111 1
11:00 A112 1
12:00 S111 1
13:00 S234 1
17:00 A234 2
18:00 S747 2
19:00 A878 3
I have been trying to use over and partition by to assign the values, however I can only get into to increment all the time not only on a change. Also tried using lag but I struggled to make use of that.
I only need the value in the second column to start from 1 and increment when the first letter of field 1 changes (using substr).
This is my attempt using row_number but I am far off I think. There would be a time column in the output as well not shown above.
select event_time, st_location, Row_Number() over(partition by
SUBSTR(location,1,1) order
by event_time)
as groupnumber from pic
Any help would be really appreciated!
Edit:
Time Location Group Number
10:00 A-10112 1
11:00 A-10421 1
12:00 ST-10621 1
13:00 ST-23412 1
17:00 A-19112 2
18:00 ST-74712 2
19:00 A-87812 3
It is a gap and island problem. Use the following code:
select location,
dense_rank() over (partition by SUBSTR(location,1,1) order by grp)
from
(
select (row_number() over (order by time)) -
(row_number() over (partition by SUBSTR(location,1,1) order by time)) grp,
location,
time
from data
) t
order by time
dbfiddle demo
The main idea is in the subquery which isolates consecutive sequences of items (computation of grp column). The rest is simple once you have the grp column.
select DENSE_RANK() over(partition by SUBSTR("location",1,1) ORDER BY SUBSTR("location",1,2))
as Rownumber,
"location" from Table1;
Demo
http://sqlfiddle.com/#!4/21120/16

Teradata SQL: Determine how many accounts had status change in given month

Ok, so I have a table that looks something like this:
Acct_id Eff_dt Expr_dt Prod_cd Open_dt
-------------------------------------------------------
111 2012-05-01 2013-06-01 A 2012-05-01
111 2013-06-02 2014-03-08 A 2012-05-01
111 2014-03-09 9999-12-31 B 2012-05-01
222 2015-07-15 2015-11-11 A 2015-07-15
222 2015-11-12 2016-08-08 B 2015-07-15
222 2016-08-09 9999-12-31 A 2015-07-15
333 2016-01-01 2016-04-15 B 2016-01-01
333 2016-04-16 2016-08-08 B 2016-01-01
333 2016-08-09 9999-12-31 A 2016-01-01
444 2017-02-03 2017-05-15 A 2017-02-03
444 2017-05-16 2017-12-02 A 2017-02-03
444 2017-12-03 9999-12-31 B 2017-02-03
555 2017-12-12 9999-12-31 B 2017-12-12
There are many more columns that I'm not including as they're otherwise not relevant.
What I'm trying to determine is how many accounts had a change in Prod_cd in a given month, but then only in one direction (so from A > B in this example). Sometimes however an account was first opened as B, and then later changed to A. Or it was opened as A, changed to B, and moved back to A. I only want to know the current set of accounts where in a given month the Prod_cd changed from A to B.
Eff_dt is the date when a change was made to an account (could be any change, such as address change, name change, or what I'm looking for, product code change).
Expr_dt is the expiration date of that row, essentially the last day before a new change was made. When the date of that row is 9999-12-31, that's the most current row.
Open_dt is the date the account was created.
I created a query at first that was something like this:
select
count(distinct acct_id)
from table
where prod_cd = 'B'
and expr_dt = '9999-12-31'
and eff_dt between '2017-12-01' and '2017-12-31'
and open_dt < '2017-12-01'
But it's giving me results that don't look right. I want to specifically track the # of conversions that happened, but the count of accounts I'm getting seems way too high.
There is probably a way to create a more reliable query using window functions, but given that the Prod_cd changes can happen in multiple directions, I'm not sure how to write that query. Any help would be appreciated!
If you are specifically looking for the switch A --> B, then the simplest method is to use lag(). But, Teradata requires a slightly different formulation:
select count(distinct acct_id)
from (select t.*,
max(prod_cd) over (partition by acct_id order by effdt rows between 1 preceding and 1 preceding) as prev_prod_cd
from t
) t
where prod_cd = 'B' and prev_prod_cd = 'A' and
expr_dt = '9999-12-31' and
eff_dt between '2017-12-01' and '2017-12-31' and
open_dt < '2017-12-01';
I am guessing that the date conditions go in the outer query -- meaning that they lag() does not use them.
Similar to Gordon's answer, but using a supported window function (instead of LAG) and using Teradata's QUALIFY clause to do the lag-gy lookup:
SELECT DISTINCT acct_id
FROM mytable
QUALIFY
MAX(prod_cd) OVER (PARTITION BY acct_id ORDER BY eff_dt ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) = 'A'
AND prod_cd = 'B'
AND expr_dt = '9999-12-31'
AND eff_dt between DATE '2013-01-01' and DATE '2017-12-31'
AND open_dt < DATE '2017-12-01'

How to retrieve trips from historical data?

I have the following table mytable in Hive:
id radar_id car_id datetime
1 A21 123 2017-03-08 17:31:19.0
2 A21 555 2017-03-08 17:32:00.0
3 A21 777 2017-03-08 17:33:00.0
4 B15 123 2017-03-08 17:35:22.0
5 B15 555 2017-03-08 17:34:05.0
5 B15 777 2017-03-08 20:50:12.0
6 A21 123 2017-03-09 11:00:00.0
7 C11 123 2017-03-09 11:10:00.0
8 A21 123 2017-03-09 11:12:00.0
9 A21 555 2017-03-09 11:12:10.0
10 B15 123 2017-03-09 11:14:00.0
11 C11 555 2017-03-09 11:20:00.0
I want to get the routes of cars passing through radars A21 and B15 within the same trip. For example, if the date is different for the same car_id, then it is not the same trip. Basically, I want to consider that the maximum time difference between radars A21 and B15 for the same vehicle should be 30 minutes. If it's bigger, then the trip is not the same, like for example for the car_id 777.
My final goal is to count the average number of trips per day (non-unique, so if the same car passed 2 times by the same route, then it should be calculated 2 times).
The expected result is the following one:
radar_start radar_end avg_tripscount_per_day
A21 B15 1.5
On the date 2017-03-08 there are 2 trips between radars A21 and B15 (car 777 is not considered due to 30 minutes limit), while on the date 2017-03-09 there is only 1 trip. The average is 2+1=1.5 trips per day.
How can I get this result? Basically, I do not know how to introduce 30 minutes limit in the query and how to group rides by radar_start and radar_end.
Thanks.
Update:
The trip is registered at the date it started.
If the car was triggered by radar A21 at 2017-03-08 23:55 and by radar B15 at 2017-03-09 00:15, then it should be considered as the same trip registered for the date 2017-03-08.
In case of ids 6 and 8 the same car 123 passed by A21 two times, and then it turned to B15 (id 10). The last ride with id 8 should be considered. So, 8-10. Thus, the closest previous to B15. The interpretation is that a car passed by A21 two times and the second time is turned to B15.
select count(*) / count(distinct to_date(datetime)) as trips_per_day
from (select radar_id
,datetime
,lead(radar_id) over w as next_radar_id
,lead(datetime) over w as next_datetime
from mytable
where radar_id in ('A21','B15')
window w as
(
partition by car_id
order by datetime
)
) t
where radar_id = 'A21'
and next_radar_id = 'B15'
and datetime + interval '30' minutes >= next_datetime
;
+----------------+
| trips_per_day |
+----------------+
| 1.5 |
+----------------+
P.s.
If your version does not support intervals, the last code record could be replaced by -
and to_unix_timestamp(datetime) + 30*60 > to_unix_timestamp(next_datetime)
I missed that you're using Hive so started writing query for SQL-Server, but maybe it will help for you. Try something like this:
QUERY
select radar_start,
radar_end,
convert(decimal(6,3), count(*)) / convert(decimal(6,3), count(distinct dt)) as avg_tripscount_per_day
from (
select
t1.radar_id as radar_start,
t2.radar_id as radar_end,
convert(date, t1.[datetime]) dt,
row_number() over (partition by t1.radar_id, t1.car_id, convert(date, t1.[datetime]) order by t1.[datetime] desc) rn1,
row_number() over (partition by t2.radar_id, t2.car_id, convert(date, t2.[datetime]) order by t2.[datetime] desc) rn2
from trips as t1
join trips as t2 on t1.car_id = t2.car_id
and datediff(minute,t1.[datetime], t2.[datetime]) between 0 and 30
and t1.radar_id = 'A21'
and t2.radar_id = 'B15'
)x
where rn1 = 1 and rn2 = 1
group by radar_start, radar_end
OUPUT
radar_start radar_end avg_tripscount_per_day
A21 B15 1.5000000000
SAMPLE DATA
create table trips
(
id int,
radar_id char(3),
car_id int,
[datetime] datetime
)
insert into trips values
(1,'A21',123,'2017-03-08 17:31:19.0'),
(2,'A21',555,'2017-03-08 17:32:00.0'),
(3,'A21',777,'2017-03-08 17:33:00.0'),
(4,'B15',123,'2017-03-08 17:35:22.0'),
(5,'B15',555,'2017-03-08 17:34:05.0'),
(5,'B15',777,'2017-03-08 20:50:12.0'),
(6,'A21',123,'2017-03-09 11:00:00.0'),
(7,'C11',123,'2017-03-09 11:10:00.0'),
(8,'A21',123,'2017-03-09 11:12:00.0'),
(9,'A21',555,'2017-03-09 11:12:10.0'),
(8,'B15',123,'2017-03-09 11:14:00.0'),
(9,'C11',555,'2017-03-09 11:20:00.0')

How many Days each item was in each State, the full value of the period

This post is really similar to my question:
SQL Server : how many days each item was in each state
but I dont have the column Revision to see wich is the previous state, and also I want to get the full time of a status, I b
....
I'm want to get how long one item has been in one status in general, my table look like this:
ID DATE STATUS
3D56B7B1-FCB3-4897-BAEB-004796E0DC8D 2016-04-05 11:30:00.000 1
3D56B7B1-FCB3-4897-BAEB-004796E0DC8D 2016-04-08 11:30:00.000 13
274C5DA9-9C38-4A54-A697-009933BB7B7F 2016-04-29 08:00:00.000 5
274C5DA9-9C38-4A54-A697-009933BB7B7F 2016-05-04 08:00:00.000 4
A70A66DC-9D9E-49BE-93CF-00F9E3E06CE2 2016-04-14 07:50:00.000 1
A70A66DC-9D9E-49BE-93CF-00F9E3E06CE2 2016-04-21 14:00:00.000 2
A70A66DC-9D9E-49BE-93CF-00F9E3E06CE2 2016-04-23 12:15:00.000 3
A70A66DC-9D9E-49BE-93CF-00F9E3E06CE2 2016-04-23 16:15:00.000 1
BF122AE1-CB39-4967-8F37-012DC55E92A7 2016-04-05 10:30:00.000 1
BF122AE1-CB39-4967-8F37-012DC55E92A7 2016-04-20 17:00:00.000 5
I want to get this
Column 1 : ID Column 2 : Status Column 3 : Time with the status
Column 3 : Time with the status
= NextDate - PreviosDate + 1
if is the last Status, is count as 1
if is more than one Status on the same day, I get the Last one (u can say that only mather the last Status of the day)
by ID, Status must be unique
I should look like this:
ID STATUS TIME
3D56B7B1-FCB3-4897-BAEB-004796E0DC8D 1 3
3D56B7B1-FCB3-4897-BAEB-004796E0DC8D 13 1
274C5DA9-9C38-4A54-A697-009933BB7B7F 5 5
274C5DA9-9C38-4A54-A697-009933BB7B7F 4 1
A70A66DC-9D9E-49BE-93CF-00F9E3E06CE2 1 8
A70A66DC-9D9E-49BE-93CF-00F9E3E06CE2 2 2
BF122AE1-CB39-4967-8F37-012DC55E92A7 1 15
BF122AE1-CB39-4967-8F37-012DC55E92A 5 1
Thanks to #ConradFrix comments, this is how works ..
WITH CTE
AS
(
SELECT
ID,
STATUS,
DATE,
LEAD(DATE, 1) over (partition by ID order by DATE) LEAD,
ISNULL(DATEDIFF(DAYOFYEAR, DATE,
LEAD(DATE, 1) over (partition by ID order by DATE)), 1) DIF_BY_LEAD
FROM TABLE_NAME
)
SELECT ID, STATUS, SUM(DIF_BY_LEAD) AS TIME_STATUS
FROM CTE GROUP BY ID, STATUS
ORDER BY ID, STATUS