HIVE - compute statistics over partitions with window based on date

HIVE - compute statistics over partitions with window based on date - sql

I've seen solutions for problems similar to mine, but none quite work for me. Also I'm confident that there should be a way to make it work.
Given a table with
ID
Date
target
1
2020-01-01
1
1
2020-01-02
1
1
2020-01-03
0
1
2020-01-04
1
1
2020-01-04
0
1
2020-06-01
1
1
2020-06-02
1
1
2020-06-03
0
1
2020-06-04
1
1
2020-06-04
0
2
2020-01-01
1
ID is BIGINT, target is Int and Date is DATE
I want to compute, for each ID/Date, the sum and the number of rows for the same ID in the 3 months and 12 months before the Date (inclusive). Example of output:
ID
Date
Sum_3
Count_3
Sum_12
Count_12
1
2020-01-01
1
1
1
1
1
2020-01-02
2
2
2
2
1
2020-01-03
2
3
2
3
1
2020-01-04
3
5
3
5
1
2020-06-01
1
1
4
6
1
2020-06-02
2
2
5
7
1
2020-06-03
2
3
6
8
1
2020-06-04
3
5
7
10
2
2020-01-01
1
1
1
1
How can I get this time of results in HIVE?
I'm not sure if I should use analytical functions (and how), group by, etc...?

If you can live with an approximation of months as a number of days, then you can use window functions in Hive:
select id, date,
count(*) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding -- 90 days
) as count_3,
sum(target) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding
) as sum_3,
count(*) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding -- 360 days
) as count_12,
sum(target) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding
) as sum_12
from mytable
You can aggregate in the same query:
select id, date,
sum(count(*)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding -- 90 days
) as count_3,
sum(sum(target)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding
) as sum_3,
sum(count(*)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding -- 360 days
) as count_12,
sum(sum(target)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding
) as sum_12
from mytable
group by id, date, unix_timestamp(date)

If you can do an estimation of interval (1 month = 30 days): (an improvement of GMB's answer)
with t as (
select ID, Date,
sum(target) target,
count(target) c_target
from table
group by ID, Date
)
select ID, Date,
sum(target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 90 preceding
) sum_3,
sum(c_target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 90 preceding
) count_3,
sum(target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 360 preceding
) sum_12,
sum(c_target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 360 preceding
) count_12
from t
Or if you want exact intervals, you can do self joins (but expensive):
with t as (
select ID, Date,
sum(target) target,
count(target) c_target
from table
group by ID, Date
)
select
t_3month.ID,
t_3month.Date,
t_3month.sum_3,
t_3month.count_3,
sum(t3.target) sum_12,
sum(t3.c_target) count_12
from (
select
t1.ID,
t1.Date,
sum(t2.target) sum_3,
sum(t2.c_target) count_3
from t t1
left join t t2
on t2.Date > t1.Date - interval 3 month and
t2.Date <= t1.Date and
t1.ID = t2.ID
group by t1.ID, t1.Date
) t_3month
left join t t3
on t3.Date > t_3month.Date - interval 12 month and
t3.Date <= t_3month.Date and
t_3month.ID = t3.ID
group by t_3month.ID, t_3month.Date, t_3month.sum_3, t_3month.count_3
order by ID, Date;

Related

Query for the longest duration of consecutive TRUE [duplicate]

I have the following table in SQL Server. I would like to find the longest duration for the machine running.
Row
DateTime
Machine On
1
9/22/2022 8:20
1
2
9/22/2022 9:10
0
3
9/22/2022 10:40
1
4
9/22/2022 10:52
0
5
9/22/2022 12:30
1
6
9/22/2022 14:30
0
7
9/22/2022 15:00
1
8
9/22/2022 15:40
0
9
9/22/2022 16:25
1
10
9/22/2022 16:55
0
In the example above, the longest duration for the machine is ON is 2 hours using rows 5 and 6. What would be the best SQL statement that can provide the longest duration given a time range?
Desired Result:
60 minutes
I have looked into the LAG Function and the LEAD Function in SQL.

Here's another way that uses traditional gaps & islands methodology:
WITH src AS
(
SELECT Island, mint = MIN([Timestamp]), maxt = MAX([Timestamp])
FROM
(
SELECT [Timestamp], Island =
ROW_NUMBER() OVER (ORDER BY [Timestamp]) -
ROW_NUMBER() OVER (PARTITION BY Running ORDER BY [Timestamp])
FROM dbo.Machine_Status
) AS x GROUP BY Island
)
SELECT TOP (1) delta =
(DATEDIFF(second, mint, LEAD(mint,1) OVER (ORDER BY island)))
FROM src ORDER BY delta DESC;
Example db<>fiddle based on the sample data in your new duplicate.

If this is really your data, you can simply use INNER JOIN and DATEDIFF:
SELECT MAX(DATEDIFF(MINUTE, T1.[DateTime], T2.[DateTime]))
FROM [my_table] T1
INNER JOIN [my_table] T2
ON T1.[Row] + 1 = T2.[Row];

This is a gaps and islands problem, one option to solve it is to use a running sum that increased by 1 whenever a machine_on = 0, this will define unique groups for consecutive 1s followed by 0.
select top 1 datediff(minute, min([datetime]), max([datetime])) duration
from
(
select *,
sum(case when machine_on = 0 then 1 else 0 end) over (order by datetime desc) grp
from table_name
) T
group by grp
order by datediff(minute, min([datetime]), max([datetime])) desc
See demo

This is a classic Gaps and Islands with a little twist Adj
Example
Select Top 1
Row1 = min(row)
,Row2 = max(row)+1
,TS1 = min(TimeStamp)
,TS2 = dateadd(SECOND,max(Adj),max(TimeStamp))
,Dur = datediff(Second,min(TimeStamp),max(TimeStamp)) + max(Adj)
From (
Select *
,Grp = row_number() over( partition by Running order by TimeStamp) - row_number() over (order by timeStamp)
,Adj = case when Running=1 and lead(Running,1) over (order by timestamp) = 0 then datediff(second,TimeStamp,lead(TimeStamp,1) over (order by TimeStamp) ) else 0 end
From Machine_Status
) A
Where Running=1
Group By Grp
Order By Dur Desc
Results
Row1 Row2 TS1 TS2 Dur
8 12 2023-01-10 08:25:30.000 2023-01-10 08:28:55.000 205

oracle SQL counter restarts when time difference is > x

I want to create a new column in my query whereby it takes into account the difference of the current rows datetime - previous datetime. This column could be a counter where if the difference is <-100, it stays as 1, but once there difference is > -100, the column is 0.
Ideally then I would want to only pull in the rows that come after the last 0 record.
My query:
with products as (
select * from (
select distinct
ID,
UnixDateTime,
OrderNumber,
to_date('1970-01-01','YYYY-MM-DD') + numtodsinterval(UnixDateTime,'SECOND')+ 1/24 as "Date_Time"
from DB
where
(date '1970-01-01' + UnixDateTime * interval '1' second) + interval '1' hour
> sysdate - interval '2' day
)
),
prod_prev AS (
SELECT p.*,
lag("Date_Time")over(order by "Date_Time" ASC) as Previous_Time,
lag(UnixDateTime)over(order by "Date_Time" ASC) as UnixDateTime_Previous_Time "Date_Time") - "Date_Time" AS diff
FROM products p
),
run_sum AS (
SELECT p.*, "Date_Time"-Previous_Time as "Diff", UnixDateTime_Previous_Time-UnixDateTime AS "UnixDateTime_Diff"
FROM prod_prev p
)
SELECT * FROM run_sum
ORDER By UnixDateTime, "Date_Time" DESC
my query result from above query:
ID
UnixDateTime
OrderNumber
Date_Time
Previous_Time
diff
UnixDateTime_Diff
1
1662615688
100
08-SEP-2022 06:41:28
(null)
(null)
(null)
2
1662615752
100
08-SEP-2022 06:42:32
08-SEP-2022 06:41:28
0.00074
-64
3
1662615765
100
08-SEP-2022 06:42:45
008-SEP-2022 06:42:32
0.000150
-13
4
1662615859
100
08-SEP-2022 06:44:19
08-SEP-2022 06:42:45
0.001088
-128
5
1662615987
100
08-SEP-2022 06:46:27
08-SEP-2022 06:44:19
0.00148
-44
6
1662616031
100
08-SEP-2022 06:47:11
08-SEP-2022 06:46:27
0.00051
-36
the counter is the below example should be 1 if the UnixDateTime_Diff is < -100 and 0 if its >-100
then if I could only pull in records AFTER the most recent 0 record.

You use:
lag("Date_Time")over(order by "Date_Time" DESC)
And get the previous value when the values are ordered in DESCending order; this will get the previous higher value. If you want the previous lower value then either use:
lag("Date_Time") over (order by "Date_Time" ASC)
or
lead("Date_Time") over (order by "Date_Time" DESC)
If you want to perform row-by-row processing then, from Oracle 12, you can use MATCH_RECOGNIZE:
SELECT id,
unixdatetime,
ordernumber,
date_time,
next_unixdatetime,
next_unixdatetime - unixdatetime AS diff,
CASE cls
WHEN 'WITHIN_100' THEN 1
ELSE 0
END AS within_100
from (
select distinct
ID,
UnixDateTime,
OrderNumber,
TIMESTAMP '1970-01-01 00:00:00 UTC' + UnixDateTime * INTERVAL '1' SECOND
AS Date_Time
from DB
where TIMESTAMP '1970-01-01 00:00:00 UTC' + UnixDateTime * INTERVAL '1' SECOND
> SYSTIMESTAMP - INTERVAL '2' DAY
)
MATCH_RECOGNIZE(
ORDER BY unixdatetime
MEASURES
NEXT(unixdatetime) AS next_unixdatetime,
classifier() AS cls
ALL ROWS PER MATCH
PATTERN (within_100* any_row)
DEFINE
within_100 AS NEXT(unixdatetime) < unixdatetime + 100
) m
Which, for the sample data:
CREATE TABLE db (ID, UnixDateTime, OrderNumber) AS
SELECT 1, 1662615688, 100 FROM DUAL UNION ALL
SELECT 2, 1662615752, 100 FROM DUAL UNION ALL
SELECT 3, 1662615765, 100 FROM DUAL UNION ALL
SELECT 4, 1662615859, 100 FROM DUAL UNION ALL
SELECT 5, 1662615987, 100 FROM DUAL UNION ALL
SELECT 6, 1662616031, 100 FROM DUAL;
Outputs:
ID
UNIXDATETIME
ORDERNUMBER
DATE_TIME
NEXT_UNIXDATETIME
DIFF
WITHIN_100
1
1662615688
100
2022-09-08 05:41:28.000000000 UTC
1662615752
64
1
2
1662615752
100
2022-09-08 05:42:32.000000000 UTC
1662615765
13
1
3
1662615765
100
2022-09-08 05:42:45.000000000 UTC
1662615859
94
1
4
1662615859
100
2022-09-08 05:44:19.000000000 UTC
1662615987
128
0
5
1662615987
100
2022-09-08 05:46:27.000000000 UTC
1662616031
44
1
6
1662616031
100
2022-09-08 05:47:11.000000000 UTC
null
null
0
fiddle

How to create a start and end date with no gaps from one date column and to sum a value within the dates

I am new SQL coding using in SQL developer.
I have a table that has 4 columns: Patient ID (ptid), service date (dt), insurance payment amount (insr_amt), out of pocket payment amount (op_amt). (see table 1 below)
What I would like to do is (1) create two columns "start_dt" and "end_dt" using the "dt" column where if there are no gaps in the date by the patient ID then populate the start and end date with the first and last date by patient ID, however if there is a gap in service date within the patient ID then to create the separate start and end date rows per patient ID, along with (2) summing the two payment amounts by patient ID with in the one set of start and end date visits (see table 2 below).
What would be the way to run this using SQL code in SQL developer?
Thank you!
Table 1:
Ptid
dt
insr_amt
op_amt
A
1/1/2021
30
20
A
1/2/2021
30
10
A
1/3/2021
30
10
A
1/4/2021
30
30
B
1/6/2021
10
10
B
1/7/2021
20
10
C
2/1/2021
15
30
C
2/2/2021
15
30
C
2/6/2021
60
30
Table 2:
Ptid
start_dt
end_dt
total_insr_amt
total_op_amt
A
1/1/2021
1/4/2021
120
70
B
1/6/2021
1/7/2021
30
20
C
2/1/2021
2/2/2021
30
60
C
2/6/2021
2/6/2021
60
30

You didn't mention the specific database so this solution works in PostgreSQL. You can do:
select
ptid,
min(dt) as start_dt,
max(dt) as end_dt,
sum(insr_amt) as total_insr_amt,
sum(op_amt) as total_op_amt
from (
select *,
sum(inc) over(partition by ptid order by dt) as grp
from (
select *,
case when dt - interval '1 day' = lag(dt) over(partition by ptid order by dt)
then 0 else 1 end as inc
from t
) x
) y
group by ptid, grp
order by ptid, grp
Result:
ptid start_dt end_dt total_insr_amt total_op_amt
----- ---------- ---------- -------------- -----------
A 2021-01-01 2021-01-04 120 70
B 2021-01-06 2021-01-07 30 20
C 2021-02-01 2021-02-02 30 60
C 2021-02-06 2021-02-06 60 30
See running example at DB Fiddle 1.
EDIT for Oracle
As requested, the modified query that works in Oracle is:
select
ptid,
min(dt) as start_dt,
max(dt) as end_dt,
sum(insr_amt) as total_insr_amt,
sum(op_amt) as total_op_amt
from (
select x.*,
sum(inc) over(partition by ptid order by dt) as grp
from (
select t.*,
case when dt - 1 = lag(dt) over(partition by ptid order by dt)
then 0 else 1 end as inc
from t
) x
) y
group by ptid, grp
order by ptid, grp
See running example at db<>fiddle 2.

Project data and cumulative sum forward

I am trying to push the last value of a cumulative dataset forward to present time.
Initialise test data:
drop table if exists test_table;
create table test_table
as select data_date::date, floor(random() * 10) as data_value
from
generate_series('2021-08-25'::date, '2021-08-31'::date, '1 day') data_date;
The above test data produces something like this:
data_date data_value cumulative_value
2021-08-25 1 1
2021-08-26 7 8
2021-08-27 8 16
2021-08-28 7 23
2021-08-29 2 25
2021-08-30 2 27
2021-08-31 7 34
What I wish to do, is push the last data value (2021-08-31 7) forward to present time. For example, say today's date was 2021-09-03, I would want the result to be something like:
data_date data_value cumulative_value
2021-08-25 1 1
2021-08-26 7 8
2021-08-27 8 16
2021-08-28 7 23
2021-08-29 2 25
2021-08-30 2 27
2021-08-31 7 34
2021-09-01 7 41
2021-09-02 7 48
2021-09-03 7 55

You need to get the value of the last date in the table. Common table expression is a good way to do that:
with cte as (
select data_value as last_val
from test_table
order by data_date desc
limit 1)
select
gen_date::date as data_date,
coalesce(data_value, last_val) as data_value,
sum(coalesce(data_value, last_val)) over (order by gen_date) as cumulative_sum
from generate_series('2021-08-25'::date, '2021-09-03', '1 day') as gen_date
left join test_table on gen_date = data_date
cross join cte
Test it in db<>fiddle.

You may use union and a scalar subquery to find the latest value of data_value for for the new rows. cumulative_value is re-evaluated.
select *, sum(data_value) over (rows between unbounded preceding and current row) as cumulative_value
from
(
select data_date, data_value from test_table
UNION all
select rd, (select data_value from test_table where data_date = '2021-08-31')
from generate_series('2021-09-01'::date, '2021-09-03', '1 day') rd
) t
order by data_date;
And here it is a bit smarter w/o fixed date literals.
with cte(latest_date) as (select max(data_date) from test_table)
select *, sum(data_value) over (rows between unbounded preceding and current row) as cumulative_value
from
(
select data_date, data_value from test_table
UNION ALL
select rd::date, (select data_value from test_table, cte where data_date = latest_date)
from generate_series((select latest_date from cte) + 1, CURRENT_DATE, '1 day') rd
) t
order by data_date;
SQL Fiddle here.

How to get values from the previous row?

I have a table like this:
ID
NUMBER
TIMESTAMP
1
1
05/28/2020 09:00:00
2
2
05/29/2020 10:00:00
3
1
05/31/2020 21:00:00
4
1
06/01/2020 21:00:00
And I want to show data like this:
ID
NUMBER
TIMESTAMP
RANGE
1
1
05/28/2020 09:00:00
0 Days
2
2
05/29/2020 10:00:00
0 Days
3
1
05/31/2020 21:00:00
3,5 Days
4
1
06/01/2020 21:00:00
1 Days
So it takes 3,5 Days to process the number 1 process.
I tried:
select a.id, a.number, a.timestamp, ((a.timestamp-b.timestamp)/24) as days
from my_table a
left join (select number,timestamp from my_table) b
on a.number=b.number
Didn't work as expected. How to do this properly?

Use the window function lag().
With standard interval output:
SELECT *, timestamp - lag(timestamp) OVER(PARTITION BY number ORDER BY id)
FROM tbl
ORDER BY id;
If you need decimal number like in your example:
SELECT *, round((extract(epoch FROM timestamp - lag(timestamp) OVER(PARTITION BY number ORDER BY id)) / 86400)::numeric, 2) || ' days'
FROM tbl
ORDER BY id;
If you also need to display '0 days' instead of NULL like in your example:
SELECT *, COALESCE(round((extract(epoch FROM timestamp - lag(timestamp) OVER(PARTITION BY number ORDER BY id)) / 86400)::numeric, 2), 0) || ' days'
FROM tbl
ORDER BY id;
db<>fiddle here

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

HIVE - compute statistics over partitions with window based on date - sql

Related

Query for the longest duration of consecutive TRUE [duplicate]

oracle SQL counter restarts when time difference is > x

How to create a start and end date with no gaps from one date column and to sum a value within the dates

Project data and cumulative sum forward

How to get values from the previous row?

Categories

Resources