Delete all rows except first N rows within each time interval group - sql

I have a table:
ID
DateTime
1
2022-01-30 01:02:03
1
2022-01-30 01:34:03
1
2022-01-30 02:59:03
2
2022-01-30 01:02:03
2
2022-01-30 01:34:03
2
2022-01-30 02:59:03
And I would like to delete all the rows except for 1 every hour for each unique ID. So the resulting table would look like:
ID
DateTime
1
2022-01-30 01:02:03
1
2022-01-30 02:59:03
2
2022-01-30 01:02:03
2
2022-01-30 02:59:03

You can use a cte (they could be used for delete) and window functions:
with cte as (
select *, row_number() over (
partition by id, cast(datetime as date), datepart(hour, datetime)
order by datetime
) as rn
from t
)
select * -- delete
from cte
where rn > 1
Change select * to delete once you're sure that the query contains the correct rows.

Related

SQL: How to create a daily view based on different time intervals using SQL logic?

Here is an example:
Id|price|Date
1|2|2022-05-21
1|3|2022-06-15
1|2.5|2022-06-19
Needs to look like this:
Id|Date|price
1|2022-05-21|2
1|2022-05-22|2
1|2022-05-23|2
...
1|2022-06-15|3
1|2022-06-16|3
1|2022-06-17|3
1|2022-06-18|3
1|2022-06-19|2.5
1|2022-06-20|2.5
...
Until today
1|2022-08-30|2.5
I tried using the lag(price) over (partition by id order by date)
But i can't get it right.
I'm not familiar with Azure, but it looks like you need to use a calendar table, or generate missing dates using a recursive CTE.
To get started with a recursive CTE, you can generate line numbers for each id (assuming multiple id values) in the source data ordered by date. These rows with row number equal to 1 (with the minimum date value for the corresponding id) will be used as the starting point for the recursion. Then you can use the DATEADD function to generate the row for the next day. To use the price values ​​from the original data, you can use a subquery to get the price for this new date, and if there is no such value (no row for this date), use the previous price value from CTE (use the COALESCE function for this).
For SQL Server query can look like this
WITH cte AS (
SELECT
id,
date,
price
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) AS rn
FROM tbl
) t
WHERE rn = 1
UNION ALL
SELECT
cte.id,
DATEADD(d, 1, cte.date),
COALESCE(
(SELECT tbl.price
FROM tbl
WHERE tbl.id = cte.id AND tbl.date = DATEADD(d, 1, cte.date)),
cte.price
)
FROM cte
WHERE DATEADD(d, 1, cte.date) <= GETDATE()
)
SELECT * FROM cte
ORDER BY id, date
OPTION (MAXRECURSION 0)
Note that I added OPTION (MAXRECURSION 0) to make the recursion run through all the steps, since the default value is 100, this is not enough to complete the recursion.
db<>fiddle here
The same approach for MySQL (you need MySQL of version 8.0 to use CTE)
WITH RECURSIVE cte AS (
SELECT
id,
date,
price
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) AS rn
FROM tbl
) t
WHERE rn = 1
UNION ALL
SELECT
cte.id,
DATE_ADD(cte.date, interval 1 day),
COALESCE(
(SELECT tbl.price
FROM tbl
WHERE tbl.id = cte.id AND tbl.date = DATE_ADD(cte.date, interval 1 day)),
cte.price
)
FROM cte
WHERE DATE_ADD(cte.date, interval 1 day) <= NOW()
)
SELECT * FROM cte
ORDER BY id, date
db<>fiddle here
Both queries produces the same results, the only difference is the use of the engine's specific date functions.
For MySQL versions below 8.0, you can use a calendar table since you don't have CTE support and can't generate the required date range.
Assuming there is a column in the calendar table to store date values ​​(let's call it date for simplicity) you can use the CROSS JOIN operator to generate date ranges for the id values in your table that will match existing dates. Then you can use a subquery to get the latest price value from the table which is stored for the corresponding date or before it.
So the query would be like this
SELECT
d.id,
d.date,
(SELECT
price
FROM tbl
WHERE tbl.id = d.id AND tbl.date <= d.date
ORDER BY tbl.date DESC
LIMIT 1
) price
FROM (
SELECT
t.id,
c.date
FROM calendar c
CROSS JOIN (SELECT DISTINCT id FROM tbl) t
WHERE c.date BETWEEN (
SELECT
MIN(date) min_date
FROM tbl
WHERE tbl.id = t.id
)
AND NOW()
) d
ORDER BY id, date
Using my pseudo-calendar table with date values ranging from 2022-05-20 to 2022-05-30 and source data in that range, like so
id
price
date
1
2
2022-05-21
1
3
2022-05-25
1
2.5
2022-05-28
2
10
2022-05-25
2
100
2022-05-30
the query produces following results
id
date
price
1
2022-05-21
2
1
2022-05-22
2
1
2022-05-23
2
1
2022-05-24
2
1
2022-05-25
3
1
2022-05-26
3
1
2022-05-27
3
1
2022-05-28
2.5
1
2022-05-29
2.5
1
2022-05-30
2.5
2
2022-05-25
10
2
2022-05-26
10
2
2022-05-27
10
2
2022-05-28
10
2
2022-05-29
10
2
2022-05-30
100
db<>fiddle here

PostgreSQL ROW_NUMBER with timestamp conditions

I'm trying to extend PARTITION BY to keep rows in same partition if ts_created of current row is within 1hour of previous row.
SELECT t1.id,
t1.user_email,
t1.ts_created,
t1.prev_ts
row_number() OVER (PARTITION BY t1.user_email ORDER BY t1.ts_created DESC) AS time_order
FROM (SELECT id,
user_email,
ts_created,
lag(ts_created) OVER(PARTITION BY user_email ORDER BY ts_created DESC) AS prev_ts
FROM table1) AS t1 ORDER BY t1.ts_created DESC;
So far i'm doing partition over user_email and prepared timestamp of previous row, now i'm abit lost on how to handle time component between current and previous row.
expectation
id
user_email
ts_created
time_order
6
mailA
2022-01-01 07:30:00.000
1
5
mailA
2022-01-01 06:40:00.000
2
4
mailA
2022-01-01 05:50:00.000
3
3
mailA
2022-01-01 05:00:00.000
4
2
mailA
2022-01-01 03:50:00.000
1
1
mailB
2021-01-01 03:30:00.000
1

Using the earliest date of a partition to determine what other dates belong to that partition

Assume this is my table:
ID DATE
--------------
1 2018-11-12
2 2018-11-13
3 2018-11-14
4 2018-11-15
5 2018-11-16
6 2019-03-05
7 2019-05-07
8 2019-05-08
9 2019-05-08
I need to have partitions be determined by the first date in the partition. Where, any date that is within 2 days of the first date, belongs in the same partition.
The table would end up looking like this if each partition was ranked
PARTITION ID DATE
------------------------
1 1 2018-11-12
1 2 2018-11-13
1 3 2018-11-14
2 4 2018-11-15
2 5 2018-11-16
3 6 2019-03-05
4 7 2019-05-07
4 8 2019-05-08
4 9 2019-05-08
I've tried using datediff with lag to compare to the previous date but that would allow a partition to be inappropriately sized based on spacing, for example all of these dates would be included in the same partition:
ID DATE
--------------
1 2018-11-12
2 2018-11-14
3 2018-11-16
4 2018-11-18
3 2018-11-20
4 2018-11-22
Previous flawed attempt:
Mark when a date is more than 2 days past the previous date:
(case when datediff(day, lag(event_time, 1) over (partition by user_id, stage order by event_time), event_time) > 2 then 1 else 0 end)
You need to use a recursive CTE for this, so the operation is expensive.
with t as (
-- add an incrementing column with no gaps
select t.*, row_number() over (order by date) as seqnum
from t
),
cte as (
select id, date, date as mindate, seqnum
from t
where seqnum = 1
union all
select t.id, t.date,
(case when t.date <= dateadd(day, 2, cte.mindate)
then cte.mindate else t.date
end) as mindate,
t.seqnum
from cte join
t
on t.seqnum = cte.seqnum + 1
)
select cte.*, dense_rank() over (partition by mindate) as partition_num
from cte;

Calculate Date difference between two consecutive rows

I have a table which contains datetime rows like below.
ID | DateTime
1 | 12:00
2 | 12:02
3 | 12:03
4 | 12:04
5 | 12:05
6 | 12:10
I want to identify those rows where there is a 'gap' of 5 minutes between rows (for example, row 5 and 6).
I know that we need to use DATEDIFF, but how can I only get those rows which are consecutive with each other?
You can use LAG, LEAD window functions for this:
SELECT ID
FROM (
SELECT ID, [DateTime],
DATEDIFF(mi, LAG([DateTime]) OVER (ORDER BY ID), [DateTime]) AS prev_diff,
DATEDIFF(mi, [DateTime], LEAD([DateTime]) OVER (ORDER BY ID)) AS next_diff
FROM mytable) AS t
WHERE prev_diff >= 5 OR next_diff >= 5
Output:
ID
==
5
6
Note: The above query assumes that order is defined by ID field. You can easily substitute this field with any other field that specifies order in your table.
You might try this (I'm not sure if it's really fast)
SELECT current.datetime AS current_datetime,
previous.datetime AS previous_datetime,
DATEDIFF(minute, previous.datetime, current.datetime) AS gap
FROM my_table current
JOIN my_table previous
ON previous.datetime < current.datetime
AND NOT EXISTS (SELECT *
FROM my_table others
WHERE others.datetime < current.datetime
AND others.datetime > previous.datetime);
update SS2012: Use LAG
DECLARE #tbl TABLE(ID INT, T TIME)
INSERT INTO #tbl VALUES
(1,'12:00')
,(2,'12:02')
,(3,'12:03')
,(4,'12:04')
,(5,'12:05')
,(6,'12:10');
WITH TimesWithDifferenceToPrevious AS
(
SELECT ID
,T
,LAG(T) OVER(ORDER BY T) AS prev
,DATEDIFF(MI,LAG(T) OVER(ORDER BY T),T) AS MinuteDiff
FROM #tbl
)
SELECT *
FROM TimesWithDifferenceToPrevious
WHERE ABS(MinuteDiff) >=5
The result
6 12:10:00.0000000 12:05:00.0000000 5

Select rows based on criteria from within group

I have the following table:
pk_positions ass_pos_id underlying entry_date
1 1 abc 2016-03-14
2 1 xyz 2016-03-17
3 tlt 2016-03-18
4 4 ujf 2016-03-21
5 4 dks 2016-03-23
6 4 dqp 2016-03-26
I need to select one row per ass_pos_id which has the earliest entry_date. Rows which do not have a value for ass_pos_id are not included.
In other words, for each non null ass_pos_id group, select the row which has the earliest entry_date
The following is the desired result:
pk_positions ass_pos_id underlying entry_date
1 1 abc 2016-03-14
4 4 ujf 2016-03-21
You could use the row_number window function:
SELECT pk_positions, ass_pos_id, underlying, entry_date
FROM (SELECT pk_positions, ass_pos_id, underlying, entry_date,
ROW_NUMBER() OVER (PARTITION BY ass_pos_id
ORDER BY entry_date ASC) rn
FROM mytable
WHERE ass_pos_id IS NOT NULL) t
WHERE rn = 1