Subtract /Loop through rows in HIVE Query - sql

I have a data in table like below
ID status timestamp
ABC login 1/1/2020 12:00
ABC lock 1/1/2020 13:19
ABC unlock 1/1/2020 13:52
ABC Disconnect 1/1/2020 15:52
ABC Reconnect 1/1/2020 15:55
ABC lock 1/1/2020 16:25
ABC unlock 1/1/2020 16:30
ABC logoff 1/1/2020 17:00
ABC login 2/1/2020 12:00
ABC lock 2/1/2020 13:19
ABC unlock 2/1/2020 13:52
ABC lock 2/1/2020 16:22
ABC logoff 2/1/2020 17:00
I need to find the effective working hours of an employee on a particular date for which he has really worked. Meaning sum of total time minus timings when the status was lock, disconnect.
Example: for employee ABC on 01-JAN-2020, his system was ideal between 13:19 - 13:52(33 minutes) and again from 15:52 - 15:55(3 minutes).
Hence, out of total working hour i.e... 5hrs (time between login and log off time) his effective time would be 5hr - 36 minutes = 4hr24 minutes.
Similarly for 01-FEB-2020.

You can use window functions, then aggregation:
select
id,
to_date(timestamp) timestamp_day,
sum(case when status in ('lock', 'disconnect') then - duration else duration end) / 60 / 60 hours_worked
from (
select t.*,
lead(timestamp) over(partition by id order by timestamp)
- unix_timestamp(timestamp) status_duration
from mytable t
) t
group by id, to_date(timestamp)
order by id, to_date(timestamp)
In the subquery, we use lead() to retrieve the timestamp of the "next" action, so we can compute the duration of the current step. The outer query aggregates by employee and day, and do the final computation of working hours according to your business rule.

Related

Count events with a cool-down period after each instance

In a Postgres DB I have entries for "events", associated with an id, and when they happened. I need to count them with a special rule.
When an event happens the counter is incremented and for the next 14 days all events of this type are not counted.
Example:
event
created_at
blockdate
action
16
2021-11-11 11:15
25.11.21
count
16
2021-11-11 11:15
25.11.21
block
16
2021-11-13 10:45
25.11.21
block
16
2021-11-16 10:40
25.11.21
block
16
2021-11-23 11:15
25.11.21
block
16
2021-11-23 11:15
25.11.21
block
16
2021-12-10 13:00
24.12.21
count
16
2021-12-15 13:25
24.12.21
block
16
2021-12-15 13:25
24.12.21
block
16
2021-12-15 13:25
24.12.21
block
16
2021-12-20 13:15
24.12.21
block
16
2021-12-23 13:15
24.12.21
block
16
2021-12-31 13:25
14.01.22
count
16
2022-02-05 15:00
19.02.22
count
16
2022-02-05 15:00
19.02.22
block
16
2022-02-13 17:15
19.02.22
block
16
2022-02-21 10:09
07.03.22
count
43
2021-11-26 11:00
10.12.21
count
43
2022-01-01 15:00
15.01.22
count
43
2022-04-13 10:07
27.04.22
count
43
2022-04-13 10:09
27.04.22
block
43
2022-04-13 10:09
27.04.22
block
43
2022-04-13 10:09
27.04.22
block
43
2022-04-13 10:10
27.04.22
block
43
2022-04-13 10:10
27.04.22
block
43
2022-04-13 10:47
27.04.22
block
43
2022-05-11 20:25
25.05.22
count
75
2021-10-21 12:50
04.11.21
count
75
2021-11-02 12:50
04.11.21
block
75
2021-11-18 11:15
02.12.21
count
75
2021-11-18 12:55
02.12.21
block
75
2021-11-18 16:35
02.12.21
block
75
2021-11-24 11:00
02.12.21
block
75
2021-12-01 11:00
02.12.21
block
75
2021-12-14 13:25
28.12.21
count
75
2021-12-15 13:35
28.12.21
block
75
2021-12-26 13:25
28.12.21
block
75
2022-01-31 15:00
14.02.22
count
75
2022-02-02 15:30
14.02.22
block
75
2022-02-03 15:00
14.02.22
block
75
2022-02-17 15:00
03.03.22
count
75
2022-02-17 15:00
03.03.22
block
75
2022-02-18 15:00
03.03.22
block
75
2022-02-23 15:00
03.03.22
block
75
2022-02-25 15:00
03.03.22
block
75
2022-03-04 10:46
18.03.22
count
75
2022-03-08 21:05
18.03.22
block
In Excel I simply add two columns. In one column I carry over a "blockdate", a date until when events have to be blocked. In the other column I compare the ID with the previous ID and the previous "blockdate".
When the IDs a different or the blockdate is less then the current date, I have to count. When I have to count, I set the row's blockdate to the current date + 14 days, otherwise I carry over the previous blockdate.
I tried now to solve this in Postgres with ...
window functions
recursive CTEs
lateral joins
... and all seemed a bit promising, but in the end I failed to implement this tricky count.
For example, my recursive CTE failed with:
aggregate functions are not allowed in WHERE
with recursive event_count AS (
select event
, min(created_at) as created
from test
group by event
union all
( select event
, created_at as created
from test
join event_count
using(event)
where created_at >= max(created) + INTERVAL '14 days'
order by created_at
limit 1
)
)
select * from event_count
Window functions, using lag() to access the previous row don't seem to work because they cannot access columns in the previous row which were created using the window function.
Adding a "block-or-count" information upon entering a new event entry by simply comparing with the last entry wouldn't solve the issue as event entries "go away" after about half a year. So when the first entry goes away, the next one becomes the first and the logic has to be applied on the new situation.
Above test data can be created with:
CREATE TABLE test (
event INTEGER,
created_at TIMESTAMP
);
INSERT INTO test (event, created_at) VALUES
(16, '2021-11-11 11:15'),(16, '2021-11-11 11:15'),(16, '2021-11-13 10:45'),(16, '2021-11-16 10:40'),
(16, '2021-11-23 11:15'),(16, '2021-11-23 11:15'),(16, '2021-12-10 13:00'),(16, '2021-12-15 13:25'),
(16, '2021-12-15 13:25'),(16, '2021-12-15 13:25'),(16, '2021-12-20 13:15'),(16, '2021-12-23 13:15'),
(16, '2021-12-31 13:25'),(16, '2022-02-05 15:00'),(16, '2022-02-05 15:00'),(16, '2022-02-13 17:15'),
(16, '2022-02-21 10:09'),
(43, '2021-11-26 11:00'),(43, '2022-01-01 15:00'),(43, '2022-04-13 10:07'),(43, '2022-04-13 10:09'),
(43, '2022-04-13 10:09'),(43, '2022-04-13 10:09'),(43, '2022-04-13 10:10'),(43, '2022-04-13 10:10'),
(43, '2022-04-13 10:47'),(43, '2022-05-11 20:25'),
(75, '2021-10-21 12:50'),(75, '2021-11-02 12:50'),(75, '2021-11-18 11:15'),(75, '2021-11-18 12:55'),
(75, '2021-11-18 16:35'),(75, '2021-11-24 11:00'),(75, '2021-12-01 11:00'),(75, '2021-12-14 13:25'),
(75, '2021-12-15 13:35'),(75, '2021-12-26 13:25'),(75, '2022-01-31 15:00'),(75, '2022-02-02 15:30'),
(75, '2022-02-03 15:00'),(75, '2022-02-17 15:00'),(75, '2022-02-17 15:00'),(75, '2022-02-18 15:00'),
(75, '2022-02-23 15:00'),(75, '2022-02-25 15:00'),(75, '2022-03-04 10:46'),(75, '2022-03-08 21:05');
This lends itself to a procedural solution, since it has to walk the whole history of existing rows for each event. But SQL can do it, too.
The best solution heavily depends on cardinalities, data distribution, and other circumstances.
Assuming unfavorable conditions:
Big table.
Unknown number and identity of relevant events (event IDs).
Many rows per event.
Some overlap the 14-day time frame, some don't.
Any number of duplicates possible.
You need an index like this one:
CREATE INDEX test_event_created_at_idx ON test (event, created_at);
Then the following query emulates an index-skip scan. If the table is vacuumed enough, it operates with index-only scans exclusively, in a single pass:
WITH RECURSIVE hit AS (
(
SELECT event, created_at
FROM test
ORDER BY event, created_at
LIMIT 1
)
UNION ALL
SELECT t.*
FROM hit h
CROSS JOIN LATERAL (
SELECT t.event, t.created_at
FROM test t
WHERE (t.event, t.created_at)
> (h.event, h.created_at + interval '14 days')
ORDER BY t.event, t.created_at
LIMIT 1
) t
)
SELECT count(*) AS hits FROM hit;
fiddle
I cannot stress enough how fast it's going to be. :)
It's a recursive CTE using a LATERAL subquery, all based on the magic of ROW value comparison (which not all major RDBMS supported properly).
Effectively, we make Postgres skip over the above index once and only take qualifying rows.
For detailed explanation, see:
SELECT DISTINCT is slower than expected on my table in PostgreSQL
Efficiently selecting distinct (a, b) from big table
Optimize GROUP BY query to retrieve latest row per user (chapter 1a)
Different approach?
Like you mention yourself, the unfortunate task definition forces you to re-compute all newer rows for events where old data changes.
Consider working with a constant raster instead. Like a 14-day grid starting from Jan 1 every year. Then the state of each event could be derived from the local frame. Much cheaper and more reliable.
I cannot think of how to do this without recursion.
with recursive ordered as ( -- Order and number the event instances
select event, created_at,
row_number() over (partition by event
order by created_at) as n
from test
), walk as (
-- Get and keep first instances
select event, created_at, n, created_at as current_base, true as keep
from ordered
where n = 1
union all
-- Carry base dates forward and mark records to keep
select c.event, c.created_at, c.n,
case
when c.created_at >= p.current_base + interval '14 days'
then c.created_at
else p.current_base
end as current_base,
(c.created_at >= p.current_base + interval '14 days') as keep
from walk p
join ordered c
on (c.event, c.n) = (p.event, p.n + 1)
)
select *
from walk
order by event, n;
Fiddle Here

snowflake: counting no.of rows present in an hour as single row

I have a user record for every login he does. I need to count how many times user has logged in. But I also need to consider that even though how many times a user logged in half an hour, i need to count as 1 time.
USER_ID TIMESTAMP
A1 2021-03-10 10:00:00
A1 2021-03-10 10:01:00
A1 2021-03-10 10:05:00
A1 2021-03-10 10:15:00
A1 2021-03-10 10:32:00
A1 2021-03-10 11:02:00
A1 2021-03-11 12:00:00
A2 2021-03-10 10:01:00
USER_ID TIMESTAMP
A1 4
A2 1
I am not able to figure out how to use lag and lead with the situation. Any help would be appreciatable.
SELECT user_id, count(distinct(date_trunc('hour',timestamp)::text||iff(minute(timestamp)>30,'_1','_0'))) as count
FROM table
GROUP BY 1 ORDER BY 1;
so this works by truncating to the hour and turning it into a string then add a suffix per half hour. Not the cleanest, but it should work.
Ah this question asked how to get time in 30 minutes truncations.
Of which the time_slice was a nice answer:
SELECT user_id, count(distinct(time_slice(timestamp, 30, 'MINUTE'))) as count
FROM table
GROUP BY user_id, ORDER BY user_id;

Multiple day on day changes based on dates in data as not continuous

See table A. There are number of sales per date. The dates are not continuous.
I want table B where it gives the sales moves per the previous date in the dataset.
I am trying to do it in SQL but get stuck. I can do an individual day on day difference by entering the date but I want one where I don't need to enter the dates manually
A
Date Sales
01/01/2019 100
05/01/2019 200
12/01/2019 50
25/01/2019 25
31/01/2019 200
B
Date DOD Move
01/01/2019 -
05/01/2019 +100
12/01/2019 -150
25/01/2019 -25
31/01/2019 +175
Use lag():
select t.*,
(sales - lag(sales) over (order by date)) as dod_move
from t;

SQL - Creating a timeline for each ID (Vertica)

I am dealing with the following problem in SQL (using Vertica):
In short -- Create a timeline for each ID (in a table where I have multiple lines, orders in my example, per ID)
What I would like to achieve -- At my disposal I have a table on historical order date and I would like to compute new customer (first order ever in the past month), active customer- (>1 order in last 1-3 months), passive customer- (no order for last 3-6 months) and inactive customer (no order for >6 months) rates.
Which steps I have taken so far -- I was able to construct a table similar to the example presented below:
CustomerID Current order date Time between current/previous order First order date (all-time)
001 2015-04-30 12:06:58 (null) 2015-04-30 12:06:58
001 2015-09-24 17:30:59 147 05:24:01 2015-04-30 12:06:58
001 2016-02-11 13:21:10 139 19:50:11 2015-04-30 12:06:58
002 2015-10-21 10:38:29 (null) 2015-10-21 10:38:29
003 2015-05-22 12:13:01 (null) 2015-05-22 12:13:01
003 2015-07-09 01:04:51 47 12:51:50 2015-05-22 12:13:01
003 2015-10-23 00:23:48 105 23:18:57 2015-05-22 12:13:01
A little bit of intuition: customer 001 placed three orders from which the second one was 147 days after its first order. Customer 002 has only placed one order in total.
What I think that the next steps should be -- I would like to know for each date (also dates on which a certain user did not place an order), for each CustomerID, how long it has been since his/her last order. This would imply that I would create some sort of timeline for each CustomerID. In the example presented above I would get 287 (days between 1st of May 2015 and 11th of February 2016, the timespan of this table) lines for each CustomerID. I have difficulties solving this previous step. When I have performed this step I want to create a field which shows at each date the last order date, the period between the last order date and the current date, and what state someone is in at the current date. For the example presented earlier, this would look something like this:
CustomerID Last order date Current date Time between current date /last order State
001 2015-04-30 12:06:58 2015-05-01 00:00:00 0 00:00:00 New
...
001 2015-04-30 12:06:58 2015-06-30 00:00:00 60 11:53:02 Active
...
001 2015-09-24 17:30:59 2016-02-01 00:00:00 129 11:53:02 Passive
...
...
002 2015-10-21 17:30:59 2015-10-22 00:00:00 0 06:29:01 New
...
002 2015-10-21 17:30:59 2015-11-30 00:00:00 39 06:29:01 Active
...
...
003 2015-05-22 12:13:01 2015-06-23 00:00:00 31 11:46:59 Active
...
003 2015-07-09 01:04:51 2015-10-22 00:00:00 105 11:46:59 Inactive
...
At the dots there should be all the inbetween dates but for sake of space I have left these out of the table.
When I know for each date what the state is of each customer (active/passive/inactive) my plan is to sum the states and group by date which should give me the sum of new, active, passive and inactive customers. From here on I can easily compute the rates at each date.
Anybody that knows how I can possibly achieve this task?
Note -- If anyone has other ideas how to achieve the goal presented above (using some other approach compared to the approach I had in mind) please let me know!
EDIT
Suppose you start from a table like this:
SQL> select * from ord order by custid, ord_date ;
custid | ord_date
--------+---------------------
1 | 2015-04-30 12:06:58
1 | 2015-09-24 17:30:59
1 | 2016-02-11 13:21:10
2 | 2015-10-21 10:38:29
3 | 2015-05-22 12:13:01
3 | 2015-07-09 01:04:51
3 | 2015-10-23 00:23:48
(7 rows)
You can use Vertica's Timeseries Analytic Functions TS_FIRST_VALUE(), TS_LAST_VALUE() to fill gaps and interpolate last_order date to the current date:
Then you just have to join this with a Vertica's TimeSeries generated from the same table with interval one day starting from the first day each customer did place his/her first order up to now (current_date):
select
custid,
status_dt,
last_order_dt,
case
when status_dt::date - last_order_dt::date < 30 then case
when nord = 1 then 'New' else 'Active' end
when status_dt::date - last_order_dt::date < 90 then 'Active'
when status_dt::date - last_order_dt::date < 180 then 'Passive'
else 'Inactive'
end as status
from (
select
custid,
last_order_dt,
status_dt,
conditional_true_event (first_order_dt is null or
last_order_dt > lag(last_order_dt))
over(partition by custid order by status_dt) as nord
from (
select
custid,
ts_first_value(ord_date) as first_order_dt ,
ts_last_value(ord_date) as last_order_dt ,
dt::date as status_dt
from
( select custid, ord_date from ord
union all
select distinct(custid) as custid, current_date + 1 as ord_date from ord
) z timeseries dt as '1 day' over (partition by custid order by ord_date)
) x
) y
where status_dt <= current_date
order by 1, 2
;
And you will get something like this:
custid | status_dt | last_order_dt | status
--------+------------+---------------------+---------
1 | 2015-04-30 | 2015-04-30 12:06:58 | New
1 | 2015-05-01 | 2015-04-30 12:06:58 | New
1 | 2015-05-02 | 2015-04-30 12:06:58 | New
...
1 | 2015-05-29 | 2015-04-30 12:06:58 | New
1 | 2015-05-30 | 2015-04-30 12:06:58 | Active
1 | 2015-05-31 | 2015-04-30 12:06:58 | Active
...
etc.

Adding a reference data to the table column from different table line

I have an event table with following columns:
sequence (int)
DeviceID (varchar(8))
time_start (datetime)
DeviceState (smallint)
time_end (datetime)
All columns except time_end are populated with the data (my current time_end column is NULL through out the table). What I'd need to do is to populate the time_end column with the event closure data. This is actually the time when new event from the same device occurred.
Here is an example data model how it should work at the end:
sequence DeviceID time_start DeviceState time_end
--------------------------------------------------------------------------------------
1 000012A7 2010-10-31 12:00 14 2010-10-31 12:10
2 000012A7 2010-10-31 12:10 18 2010-10-31 12:33
3 000012A8 2010-10-31 12:20 16 2010-10-31 13:01
4 000012A7 2010-10-31 12:33 13 2010-10-31 12:47
5 000012A7 2010-10-31 12:47 18 2010-10-31 13:20
6 000012A8 2010-10-31 13:01 20 2010-10-31 13:23
7 000012A7 2010-10-31 13:20 05 2010-10-31 14:12
8 000012A8 2010-10-31 13:23 32 2010-10-31 14:15
9 000012A7 2010-10-31 14:12 12
10 000012A8 2010-10-31 14:15 35
The idea is that for each record within the table I need to select an record on the higher sequence for specific device and update the time_end with the time_start data of that higher level record.
With this I'll be able to track the time period of each event.
I was thinking on doing this with a function call, but I have two main difficulties:
1. getting the data from e.g.: sequence=2 and updating the time_end of sequence=1
2. creating a function which will do this continuously as new records are added into the table
I'm quite new to the SQL and I'm quite lost on what else is possible. Based on my knowledge I should use the function which would reference the data together, but my current knowledge is limiting me in doing that.
I hope someone could provide me some guidance into which direction to go and to provide me some feedback if I'm on the right track or not. Any support articles would be very much appreciated.
View:
CREATE VIEW tableview AS
with timerank AS
(
SELECT mytable.*, ROW_NUMBER() OVER (PARTITION BY DeviceID ORDER BY time_start) as row
FROM THE_TABLE mytable
)
SELECT tstart.*, tend.time_start AS time_end
FROM timerank tstart
LEFT JOIN timerank tend ON tstart.row = tend.row - 1
AND tstart.DeviceID = tend.DeviceID
Edit: I see your deviceID requirement now.
#OMG Ponies: I think here will be a bit better formatting:
UPDATE YOUR_TABLE
SET time_end = (SELECT TOP 1
t.time_start
FROM YOUR_TABLE t
WHERE t.DeviceID = YOUR_TABLE.DeviceID
AND t.time_start > YOUR_TABLE.time_start
ORDER BY t.time_start ASC)