I have the following table mytable in Hive:
id radar_id car_id datetime
1 A21 123 2017-03-08 17:31:19.0
2 A21 555 2017-03-08 17:32:00.0
3 A21 777 2017-03-08 17:33:00.0
4 B15 123 2017-03-08 17:35:22.0
5 B15 555 2017-03-08 17:34:05.0
5 B15 777 2017-03-08 20:50:12.0
6 A21 123 2017-03-09 11:00:00.0
7 C11 123 2017-03-09 11:10:00.0
8 A21 123 2017-03-09 11:12:00.0
9 A21 555 2017-03-09 11:12:10.0
10 B15 123 2017-03-09 11:14:00.0
11 C11 555 2017-03-09 11:20:00.0
I want to get the routes of cars passing through radars A21 and B15 within the same trip. For example, if the date is different for the same car_id, then it is not the same trip. Basically, I want to consider that the maximum time difference between radars A21 and B15 for the same vehicle should be 30 minutes. If it's bigger, then the trip is not the same, like for example for the car_id 777.
My final goal is to count the average number of trips per day (non-unique, so if the same car passed 2 times by the same route, then it should be calculated 2 times).
The expected result is the following one:
radar_start radar_end avg_tripscount_per_day
A21 B15 1.5
On the date 2017-03-08 there are 2 trips between radars A21 and B15 (car 777 is not considered due to 30 minutes limit), while on the date 2017-03-09 there is only 1 trip. The average is 2+1=1.5 trips per day.
How can I get this result? Basically, I do not know how to introduce 30 minutes limit in the query and how to group rides by radar_start and radar_end.
Thanks.
Update:
The trip is registered at the date it started.
If the car was triggered by radar A21 at 2017-03-08 23:55 and by radar B15 at 2017-03-09 00:15, then it should be considered as the same trip registered for the date 2017-03-08.
In case of ids 6 and 8 the same car 123 passed by A21 two times, and then it turned to B15 (id 10). The last ride with id 8 should be considered. So, 8-10. Thus, the closest previous to B15. The interpretation is that a car passed by A21 two times and the second time is turned to B15.
select count(*) / count(distinct to_date(datetime)) as trips_per_day
from (select radar_id
,datetime
,lead(radar_id) over w as next_radar_id
,lead(datetime) over w as next_datetime
from mytable
where radar_id in ('A21','B15')
window w as
(
partition by car_id
order by datetime
)
) t
where radar_id = 'A21'
and next_radar_id = 'B15'
and datetime + interval '30' minutes >= next_datetime
;
+----------------+
| trips_per_day |
+----------------+
| 1.5 |
+----------------+
P.s.
If your version does not support intervals, the last code record could be replaced by -
and to_unix_timestamp(datetime) + 30*60 > to_unix_timestamp(next_datetime)
I missed that you're using Hive so started writing query for SQL-Server, but maybe it will help for you. Try something like this:
QUERY
select radar_start,
radar_end,
convert(decimal(6,3), count(*)) / convert(decimal(6,3), count(distinct dt)) as avg_tripscount_per_day
from (
select
t1.radar_id as radar_start,
t2.radar_id as radar_end,
convert(date, t1.[datetime]) dt,
row_number() over (partition by t1.radar_id, t1.car_id, convert(date, t1.[datetime]) order by t1.[datetime] desc) rn1,
row_number() over (partition by t2.radar_id, t2.car_id, convert(date, t2.[datetime]) order by t2.[datetime] desc) rn2
from trips as t1
join trips as t2 on t1.car_id = t2.car_id
and datediff(minute,t1.[datetime], t2.[datetime]) between 0 and 30
and t1.radar_id = 'A21'
and t2.radar_id = 'B15'
)x
where rn1 = 1 and rn2 = 1
group by radar_start, radar_end
OUPUT
radar_start radar_end avg_tripscount_per_day
A21 B15 1.5000000000
SAMPLE DATA
create table trips
(
id int,
radar_id char(3),
car_id int,
[datetime] datetime
)
insert into trips values
(1,'A21',123,'2017-03-08 17:31:19.0'),
(2,'A21',555,'2017-03-08 17:32:00.0'),
(3,'A21',777,'2017-03-08 17:33:00.0'),
(4,'B15',123,'2017-03-08 17:35:22.0'),
(5,'B15',555,'2017-03-08 17:34:05.0'),
(5,'B15',777,'2017-03-08 20:50:12.0'),
(6,'A21',123,'2017-03-09 11:00:00.0'),
(7,'C11',123,'2017-03-09 11:10:00.0'),
(8,'A21',123,'2017-03-09 11:12:00.0'),
(9,'A21',555,'2017-03-09 11:12:10.0'),
(8,'B15',123,'2017-03-09 11:14:00.0'),
(9,'C11',555,'2017-03-09 11:20:00.0')
Related
The question I am trying to answer is how can I return the correct order and sequence of weeks for each ID? For example, while it is true the first week for each ID will always start at 1 (its the first week in the series), it could be the following date in the series may also be within the first week (e.g., so should return 1 again) or perhaps be a date that falls in the 3rd week (e.g., so should return 3).
The code I've written so far is:
select distinct
row_number() over (partition by ID group by date) row_nums
,ID
,date
from table_a
Which simply returns the running tally of dates by ID, and doesn't take into account what week number that date falls in.
But what I'm looking for is this:
Here's some setup code to assist:
CREATE TABLE random_table
(
ID VarChar(50),
date DATETIME
);
INSERT INTO random_table
VALUES
('AAA',5/14/2021),
('AAA',6/2/2021),
('AAA',7/9/2021),
('BBB', 5/25/2021),
('CCC', 12/2/2020),
('CCC',12/6/2020),
('CCC',12/10/2020),
('CCC',12/14/2020),
('CCC',12/18/2020),
('CCC',12/22/2020),
('CCC',12/26/2020),
('CCC',12/30/2020),
('CCC',1/3/2021),
('DDD',1/7/2021),
('DDD',1/11/2021)
with adj as (
select *, dateadd(day, -1, "date") as adj_dt
from table_a
)
select
datediff(week,
min(adj_dt) over (partition by id),
adj_dt) + 1 as week_logic,
id, "date"
from adj
This assumes that your idea of weeks corresponds with ##datefirst set as Sunday. For a Sunday to Saturday definition you would find 12/06/2020 and 12/10/2020 in the same week, so presumably you want something like a Monday start instead (which also seems to line up with the numbering for 12/02/2020, 12/14/2020 and 12/18/2020.) I'm compensating by sliding backward a day in the weeks calculation. That step could be handled inline without a CTE but perhaps it illustrates the approach more clearly.
Your objective isn't clear but I think you would benefit from a Tally-Table of the weeks and then LEFT JOIN to your source data.
This will give you a row for each week AND source data if it exists
SELECT
CASE WHEN ROW_NUMBER() OVER (PARTITION BY ID ORDER BY [date])=1 THEN 1
ELSE DATEPART(WK, (DATE) ) - DATEPART(WK, FIRST_VALUE([DATE]) OVER (PARTITION BY ID ORDER BY [date])) END PD,
ID,
CONVERT(VARCHAR(10), [date],120)
FROM random_table rt
ORDER BY ID,[date]
DBFIDDLE
output:
PD
ID
(No column name)
1
AAA
2021-05-14
3
AAA
2021-06-02
8
AAA
2021-07-09
1
BBB
2021-05-25
1
CCC
2020-12-02
1
CCC
2020-12-06
1
CCC
2020-12-10
2
CCC
2020-12-14
2
CCC
2020-12-18
3
CCC
2020-12-22
3
CCC
2020-12-26
4
CCC
2020-12-30
-47
CCC
2021-01-03
1
DDD
2021-01-07
1
DDD
2021-01-11
Dates are in the format YYYY-MM-DD.
I will leave the -47 in here, so you can fix it yourself (as an exercise) 😁😉
I have following table, which apart from other attributes contains:
Customer ID - unique identifier
Value
CreatedDate - when the record has been created (based on ETL)
UpdatedDate - until when the record has been valid
Since there are other attributes apart from the [Value], which are being tracked for historical values, there might be cases, where there are multiple rows with the same [Value] for the same customer, but different timestamps in [CreatedDate] / [UpdatedDate]. Thus, the data may look like:
Customer ID
Value
CreatedDate
UpdatedDate
1
111
04/08/2021 15:00
04/08/2021 17:00
1
111
01/08/2021 09:00
04/08/2021 15:00
1
222
20/07/2021 01:30
01/08/2021 09:00
1
222
01/06/2021 08:00
20/07/2021 01:30
1
111
01/04/2021 07:15
01/06/2021 08:00
2
333
03/08/2021 04:30
04/08/2021 17:00
2
444
23/07/2021 01:20
03/08/2021 04:30
2
444
01/04/2021 13:50
23/07/2021 01:20
I would like to keep the unique [Values] in correct sequence, hence keep the [Value] for the earliest [CreatedDate], however, if Customer had originally Value1, then changed it to Value2 and finally, changed back to Value1. I would like to keep these 2 changes as well. Hence the ideal output should look like:
Customer ID
Value
CreatedDate
UpdatedDate
1
111
01/08/2021 09:00
04/08/2021 17:00
1
222
01/06/2021 08:00
01/08/2021 09:00
1
111
01/04/2021 07:15
01/06/2021 08:00
2
333
03/08/2021 04:30
04/08/2021 17:00
2
444
01/04/2021 13:50
03/08/2021 04:30
Based on CreatedDate / UpdatedDate identify, the chronological sequence of changes and identify the earliest CreatedDate and latest UpdatedDate. However, if particular value appeared multiple times, but has been interspersed by different value, I would like to keep it too.
I've tried the below approach and it works fine however it does not work for the scenario above and the output look like:
SELECT [Customer ID]
,Value
,MIN(CreatedDate) as CreatedDate
,MAX(UpdatedDate) as UpdatedDate
FROM #History
GROUP BY ID, Value
Customer ID
Value
CreatedDate
UpdatedDate
1
111
01/04/2021 07:15
04/08/2021 17:00
1
222
01/06/2021 08:00
01/08/2021 09:00
2
333
03/08/2021 04:30
04/08/2021 17:00
2
444
01/04/2021 13:50
03/08/2021 04:30
Any ideas, please? I've tried using LAG and LEAD as well, but was not able to make it work either.
This is a type of gaps-and-island problem that is probably best solved by looking for overlaps using a cumulative maximum:
select customerid, min(createddate), max(updateddate)
from (select t.*,
sum(case when prev_updatedate >= createddate then 0 else 1 end) over (partition by customerid, value order by createddate) as grp
from (select h.*,
max(updateddate) over (partition by customerid, value order by createddate rows between unbounded preceding and 1 preceding) as prev_updatedate
from #history h
) h
) h
group by customerid, value, grp;
The logic is to look at the most recent updatedate before each row for each customer and value. If this is earlier than the row's create date, then this starts are new group.
The final result is just aggregating the rows in each group.
Hi I would like to make a select expression using case or if/else which seems to be a simple solution from logic perspective but I can't seem to get it to work. Basically I am joining against two table here, the first table is customer record with date filter called min_del_date and then the second table for the model scoring table with BIN and update_date parameters.
There are two logics I want to display
Picking the model score that was the month before min_del_date
If model score month before delivery is greater than 50 (Bin > 50) then pick the model score for same month as min_del_date
My 1st logic code is below
with cust as (
select
distinct cust_no, max(del_date) as del_date, min(del_date) as min_del_date, (EXTRACT(YEAR FROM min(del_date)) -1900)*12 + EXTRACT(MONTH FROM min(del_date)) AS upd_seq
from customer.cust_history
group by 1
)
,model as (
select party_id, model_id, update_date, upd_seq, bin, var_data8, var_data2
from
(
select
party_id, update_date, bin, var_data8, var_data2,
(EXTRACT(YEAR FROM UPDATE_DATE) -1900)*12 + EXTRACT(MONTH FROM UPDATE_DATE) AS upd_seq,
dense_Rank() over (partition by (EXTRACT(YEAR FROM UPDATE_DATE) -1900)*12 + EXTRACT(MONTH FROM UPDATE_DATE) order by update_date desc) as rank1
from
(
select party_id,update_date, bin, var_data8, var_data2
from model.rpm_model
group by party_id,update_date, bin, var_data8, var_data2
) model
)model_final
where rank1 = 1
)
-- Add model scores
-- 1st logic Picking the model score that was the month before delivery date
select *
from
(
select cust.cust_no, cust.del_date, cust.min_del_date, model.upd_seq, model.bin
from cust
left join cust
on cust.cust_no = model.party_id
and cust.upd_seq = model.upd_seq + 1
)a
Now I am struggling in creating the 2nd logic in the same query?.. any assistance would be appreciated
cust table
cust_no
min_del_date
upd_seq
123
2021-01-11
1453
234
2020-06-29
1446
456
2020-07-20
1447
model table
party_id
update_date
upd_seq
BIN
123
2020-11-30
1451
22
123
2020-12-25
1452
54
123
2020-01-11
1453
14
234
2020-05-23
1445
76
234
2020-06-18
1446
48
234
2020-07-23
1447
12
456
2020-06-18
1446
23
456
2020-07-23
1447
39
456
2020-08-21
1448
21
desired results
cust_no
min_del_date
model.upd_seq
update_date
BIN
123
2021-01-11
1453
2020-01-11
14
234
2020-06-29
1446
2020-06-18
48
456
2020-07-20
1446
2020-06-18
23
Update
I managed to find the solution by myself, thanks for everyone who has attending this question. The solution is per below
select a.cust_no, a.del_date, a.min_del_date, b.update_date, b.upd_seq, b.bin
from
(
select cust.cust_no, cust.del_date, cust.min_del_date,
CASE WHEN model.BIN <=50 THEN model.upd_seq WHEN BIN > 50 THEN model.upd_seq +1 ELSE NULL END as upd_seq
from cust
inner join model
on cust.cust_no = model.party_id
and cust.upd_seq = model.upd_seq + 1
)a
inner join model b
on a.cust_no = b.party_id
and a.upd_seq = b.upd_seq
I'm trying to add new columns of first values of the day for location and weight.
For instance, the original data format is:
id dttm location weight
--------------------------------------------
1 1/1/20 11:10:00 A 40
1 1/1/20 19:07:00 B 41.1
2 1/1/20 08:01:00 B 73.2
2 1/1/20 21:00:00 B 73.2
2 1/2/20 10:03:00 C 74
I want each id to have only one day record, such as:
id dttm location weight
--------------------------------------------
1 1/1/20 11:10:00 A 40
2 1/1/20 08:01:00 B 73.2
2 1/2/20 10:03:00 C 74
I have other columns in my data set that I'm using location and weight to create, so I don't think I can just filter for 'first' records of the day.. Is it possible to write query to recognize first record of the day for those two columns and create new column with those values?
You can use row_number():
select t.*
from (select t.*,
row_number() over (partition by id, ddtm::date order by dttm) as seqnum
from t
) t
where seqnum = 1;
How to use a date range using window function in SQL Server?
I have this table:
id date item
----------------------
123 07/01/2018 anf
123 31/12/2017 sh
123 01/01/2018 ab
123 12/03/2018 fhy
123 02/01/2018 fg
124 10/12/2017 ab
124 03/03/2017 sh
125 21/11/2017 ab
125 31/12/2017 sh
125 01/03/2017 ab
126 31/12/2017 ab
I want all the information of ids from the latest date to the previous 30 days. My data has missing dates so that I cannot use over partition by rows
I need to use the similar logic of date range in window function, but it is not supported in SQL Server.
SELECT * FROM YourTable
WHERE DATEDIFF(DAY,date,GETDATE())<=30
I want all the information of ids from the latest date to the previous 30 days.
Your question is unclear on what you actually want. If you mean the latest date in the data, then you can use:
select . . .
from (select t.*, max(date) over (partition by id) as max_date
from t
) t
where date > dateadd(day, -30, max_date);