Group records for hourly count - sql

My goal is to build an hourly count for records that have a start date/time and an end date/time. The actual records are never more than 24 hours from start to finish but many times are less. It works if I bounce every record against my "clock" which has 24 slots for every date up to "today". But it can take forever to run as there can be 2000 records in a day.
This is the detail I get:
The date/times in green are what I want as the start date/time for a group. The blue date/times are what I want as the end date time for the group.
Like this:
I have tried partitioning but because, in the second pic, the 4th row has the same values as the 2nd row, it groups them together even though there is a time span between them - the third row.

This is a gaps-and-islands problem. The start and end dates match on adjacent rows, so a difference of row numbers seems sufficient:
select id, min(startdatetime), max(enddatetime),
d_id, class, location
from (select t.*,
row_number() over (partition by id order by startdatetime) as seqnum,
row_number() over (partition by id, d_id, class, location) as seqnum_2
from t
) t
group by id, d_id, class, location, (seqnum - seqnum_2);
order by id, min(startdatetime);

Related

SQL - Update column based on date comparasion with previous occurrence

I have a huge table;
I want to create a third column based on the time difference between two dates for the same id. If the difference is less than a month, then it's active, if it is between 1-2 months then inactive and anything more than 2 is dormant. The expected outcome is below;( note last entries don't have activity definitions as I don't have previous occurrences.)
My question would be, how to do such operation.
case when date_>=date_add((select max(date_) from schema.table),-30) then 'Active' when date_<date_add((select max(date_) from schema.table),-30) and date_>= date_add((select max(date_) from schema.table),-60) then 'Inactive' when date_<date_add((select max(date_) from schema.table),-60) then 'Dormant3' end as Activity
the code I came up with is not what I need as it only checks for the final entry date in the table. What I need is more akin to a for loop and checking the each row and comparing it to the previous occurrence.
edit:
By partitioning over id and dense ranking them, I reached something that almost works. I just need to compare to the previous element in the dense rank groups.
Create base data first using LEAD()
Then compare than with original row.
SELECT ID, DATE,
CASE
WHEN DATEDIFF(DATE,PREVIOUS_DATE) <=30 THEN 'Active'
DATEDIFF(DATE,PREVIOUS_DATE) between 31 and 60 'Active'
ELSE 'Dormant'
END as Activity
(SELECT ID, DATE, LEAD(DATE) OVER( partition by id ORDER BY DATE) PREVIOUS_DATE FROM MYTABLE) RS

How to compare the value of one row with the upper row in one column of an ordered table?

I have a table in PostgreSQL that contains the GPS points from cell phones. It has an integer column that stores epoch (the number of seconds from 1960). I want to order the table based on time (epoch column), then, break the trips to sub trips when there is no GPS record for more than 2 minutes.
I did it with GeoPandas. However, it is too slow. I want to do it inside the PostgreSQL. How can I compare each row of the ordered table with the previous row (to see if the epoch has a difference of 2 minutes or more)?
In fact, I do not know how to compare each row with the upper row.
You can use lag():
select t.*
from (select t.*,
lag(timestamp_epoch) over (partition by trip order by timestamp_epoch) as last_timestamp_epoch
from t
) t
where last_timestamp_epoch < timestamp_epoch - 120
I want to order the table based on time (epoch column), then, break the trips to sub trips when there is no GPS record for more than 2 minutes.
After comparing to the previous (or next) row, with the window function lag() (or lead()), form groups based on the gaps to get sub trip numbers:
SELECT *, count(*) FILTER (WHERE step) OVER (PARTITION BY trip ORDER BY timestamp_epoch) AS sub_trip
FROM (
SELECT *
, (timestamp_epoch - lag(timestamp_epoch) OVER (PARTITION BY trip ORDER BY timestamp_epoch)) > 120 AS step
FROM tbl
) sub;
Further reading:
Select longest continuous sequence

Comparing time difference for every other row

I'm trying to determine the length of time in days between using the AR_Event_Creation_Date_Time for every other row. For example, the number of days between the 1 and 2 row, 3rd and 4th, 5th and 6th etc. In other words, there will be a number of days value for every even row and NULL for every odd row. My code below works if there are only two rows per borrower number but falls down when there are more than two. In the results, notice the change in 1002092539
SELECT Borrower_Number,
Workgroup_Name,
FORMAT(AR_Event_Creation_Date_Time,'d','en-us') AS Tag_Date,
Usr_Usrnm,
DATEDIFF(day, LAG(AR_Event_Creation_Date_Time,1) OVER(PARTITION BY
Borrower_Number Order By Borrower_Number), AR_Event_Creation_Date_Time) Diff
FROM Control_Mail
You need to add in a row number. Also your partition by is non-deterministic:
SELECT Borrower_Number,
Workgroup_Name,
FORMAT(AR_Event_Creation_Date_Time,'d','en-us') AS Tag_Date,
Usr_Usrnm,
DATEDIFF(day, LAG(AR_Event_Creation_Date_Time,1) OVER(PARTITION BY Borrower_Number, (rn - 1) / 2 ORDER BY AR_Event_Creation_Date_Time),
AR_Event_Creation_Date_Time) Diff
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY Borrower_Number ORDER BY AR_Event_Creation_Date_Time) AS rn
FROM Control_Mail
) C
```

aggregate multiple rows based on time ranges

i do have a customerand he use over a specific period of time different devices, tracked with a valid_from and valid_to date. but, every time something changes for this device there will be a new row written without any visible changes for the row based data, besides a new valid from/to.
what i'm trying to do is to aggregate the first two rows into one, same for row 3 and 4, while leaving 5 and 6 as they are. all my solutions i came up so far with are working for a usage history for the user not switching back to device a. everything keeps failing.
i'd really appreciate some help, thanks in advance!
If you know that the previous valid_to is the same as the current valid_from, then you can use lag() to identify where a new grouping starts. Then use a cumulative sum to calculate the grouping and finally aggregation:
select cust, act_dev, min(valid_from), max(valid_to)
from (select t.*,
sum(case when prev_valid_to = valid_from then 0 else 1 end) over (partition by cust order by valid_from) as grouping
from (select t.*,
lag(valid_to) over (partition by cust, act_dev order by valid_from) as prev_valid_to
from t
) t
) t
group by cust, act_dev, grouping;
Here is a db<>fiddle.

SQL Aggregation with only one table

So this problem has been bugging me a little for the last week or so. I'm working with a database which hasn't exactly been designed in a way that I like and I'm having to do a lot of work-arounds to get the queries to function in a way I would like.
Essentially, I'm trying to remove duplicate entries that occur as a result of an instance caused by a previous entry. For the sake of argument say that a customer places an order or issues a job (this only occurs once) but as a result of the interactions a series of other rows are created to represent, sub-orders or jobs. Essentially, all duplicate records should have the same finish time so what I'm trying to create is a query which will return the record which has the earliest start time and ignore all other records which have the same finish time. All this occurs within the same table.
Something like:
select starttime
, endtime
, description
, entrynumber
from table
where starttime = min
and endtime = endtime
Probably what you want is something like this:
;WITH OrderedTable AS
(
Select ROW_NUMBER() OVER (PARTITION BY endtime ORDER BY starttime) as rn, starttime, endtime, description, entrynumber
From Table
)
Select starttime, endtime, description, entrynumber
FROM OrderedTable
WHERE rn=1
What this does is group all the rows with the same end time, ordered by start time and give them an additional "row number" column starting at 1 and increasing. If you filter by rn = 1, you get only the earliest start time rows, ignoring the rest.