SQL query group by nearby timestamp

SQL query group by nearby timestamp - sql

I have a table with a timestamp column. I would like to be able to group by an identifier column (e.g. cusip), sum over another column (e.g. quantity), but only for rows that are within 30 seconds of each other, i.e. not in fixed 30 second bucket intervals. Given the data:
cusip| quantity| timestamp
============|=========|=============
BE0000310194| 100| 16:20:49.000
BE0000314238| 50| 16:38:38.110
BE0000314238| 50| 16:46:21.323
BE0000314238| 50| 16:46:35.323
I would like to write a query that returns:
cusip| quantity
============|=========
BE0000310194| 100
BE0000314238| 50
BE0000314238| 100
Edit:
In addition, it would greatly simplify things if I could also get the MIN(timestamp) out of the query.

From Sean G solution, I have removed Group By on complete Table. In Fact re adjected few parts for Oracle SQL.
First after finding previous time, assign self parent id. If there a null in Previous Time, then we exclude giving it an ID.
Now based on take the nearest self parent id by avoiding nulls so that all nearest 30 seconds cusip fall under one Group.
As There is a CUSIP column, I assumed the dataset would be large market transactional data. Instead using group by on complete table, use partition by CUSIP and final Group Parent ID for better performance.
SELECT
id,
sub.parent_id,
sub.cusip,
timestamp,
quantity,
sum(sub.quantity) OVER(
PARTITION BY cusip, parent_id
) sum_quantity,
MIN(sub.timestamp) OVER(
PARTITION BY cusip, parent_id
) min_timestamp
FROM
(
SELECT
base_sub.*,
CASE
WHEN base_sub.self_parent_id IS NOT NULL THEN
base_sub.self_parent_id
ELSE
LAG(base_sub.self_parent_id) IGNORE NULLS OVER(
PARTITION BY cusip
ORDER BY
timestamp, id
)
END parent_id
FROM
(
SELECT
c.*,
CASE
WHEN nvl(abs(EXTRACT(SECOND FROM to_timestamp(previous_timestamp, 'yyyy/mm/dd hh24:mi:ss') - to_timestamp
(timestamp, 'yyyy/mm/dd hh24:mi:ss'))), 31) > 30 THEN
id
ELSE
NULL
END self_parent_id
FROM
(
SELECT
my_table.id,
my_table.cusip,
my_table.timestamp,
my_table.quantity,
LAG(my_table.timestamp) OVER(
PARTITION BY my_table.cusip
ORDER BY
my_table.timestamp, my_table.id
) previous_timestamp
FROM
my_table
) c
) base_sub
) sub
Below are the Table Rows
Input Data:
Below is the Output
RESULT

Following may be helpful to you.
Grouping of 30 second periods stating form a given time. Here it is '2012-01-01 00:00:00'. DATEDIFF counts the number of seconds between time stamp value and stating time. Then its is divided by 30 to get grouping column.
SELECT MIN(TimeColumn) AS TimeGroup, SUM(Quantity) AS TotalQuantity FROM YourTable
GROUP BY (DATEDIFF(ss, TimeColumn, '2012-01-01') / 30)
Here minimum time stamp of each group will output as TimeGroup. But you can use maximum or even grouping column value can be converted to time again for display.

Looking at the above comments, I'm assuming Chris's first scenario is the one you want (all 3 get grouped even though values 1 and 3 are not within 30 seconds of eachother, but are each within 30 seconds of value 2). Also going to assume that each row in your table has some unique ID called 'id'. You can do the following:
Create a new grouping, determining if the preceding row in your partition is more than 30 seconds behind the current row (e.g. determine if you need a new 30 second grouping, or to continue the previous). We'll call that parent_id.
Sum quantity over parent_id (plus any other aggregations)
The code could look like this
select
sub.parent_id,
sub.cusip,
min(sub.timestamp) min_timestamp,
sum(sub.quantity) quantity
from
(
select
base_sub.*,
case
when base_sub.self_parent_id is not null
then base_sub.self_parent_id
else lag(base_sub.self_parent_id) ignore nulls over (
partition by
my_table.cusip
order by
my_table.timestamp,
my_table.id
) parent_id
from
(
select
my_table.id,
my_table.cusip,
my_table.timestamp,
my_table.quantity,
lag(my_table.timestamp) over (
partition by
my_table.cusip
order by
my_table.timestamp,
my_table.id
) previous_timestamp,
case
when datediff(
second,
nvl(previous_timestamp, to_date('1900/01/01', 'yyyy/mm/dd')),
my_table.timestamp) > 30
then my_table.id
else null
end self_parent_id
from
my_table
) base_sub
) sub
group by
sub.time_group_parent_id,
sub.cusip

Related

Sum over a given time period

The following codes gives the total duration that a light has been switched on.
CREATE TABLE switch_times (
id SERIAL PRIMARY KEY,
is1 BOOLEAN,
id_dec INTEGER,
label TEXT,
ts TIMESTAMP WITH TIME ZONE default current_timestamp
);
CREATE VIEW makecount AS
SELECT *, row_number() OVER (PARTITION BY id_dec ORDER BY id) AS count
FROM switch_times;
select c1.label, SUM(c2.ts-c1.ts) AS sum
from
(makecount AS c1
inner join
makecount AS c2 ON c2.count = c1.count + 1)
where c2.is1=FALSE AND c1.id_dec = c2.id_dec AND c2.is1 != c1.is1
GROUP BY c1.label;
Link to working demo https://dbfiddle.uk/ZR8pLEBk
Any suggestions on how to alter the code so that it would give the sum over a given specific time period, say the 25th, during which all three lights were switched on for 12 hours? Problem 1: current code gives total sum, as follows. Problem 2: all durations that have not ended are disregarded, because there is no switch off time.
label sum
0x29 MH3 1 day 03:00:00
0x2B MH1 1 day 01:00:00
0x2C MH2 1 day 02:00:00
The expected results is just over a a given date, i.e.
label sum
0x29 MH3 12:00:00
0x2B MH1 12:00:00
0x2C MH2 12:00:00

Assuming the following (which should be defined in the question):
Postgres 15.
The table is big, many rows per label, performance matters, we can add indexes.
All columns are actually NOT NULL, you just forgot to declare columns as such.
Evey "light" has a distinct id_dec and a distinct label. Having both in switch_times is redundant. (Normalization!)
A light is "switched on" if the most recent earlier entry has is1 IS TRUE. Else it's considered "off".
The order of rows is established by ts, not by id as used in your query (typically incorrect).
Consecutive entries do not have to change the state.
No duplicate entries for (id_dec, ts). (There is a unique index enforcing that.)
There is no minimum or maximum time interval between entries.
"The 25th" is supposed to mean tstzrange '[2022-11-25 0:0+02, 2022-11-26 0:0+02)' (Note the time zone offsets.)
You want results for all labels that were switched on at all during the given time interval.
There is a table "labels" with one distinct entry per relevant light. If you don't have one, create it.
Indexes
Have at least these indexes to make everything fast:
CREATE INDEX ON switch_times (id_dec, ts DESC);
CREATE INDEX ON switch_times (ts);
Optional step to create table labels
CREATE TABLE labels AS
WITH RECURSIVE cte AS (
(
SELECT id_dec, label
FROM switch_times
ORDER BY 1
LIMIT 1
)
UNION ALL
(
SELECT s.id_dec, s.label
FROM cte c
JOIN switch_times s ON s.id_dec > c.id_dec
ORDER BY 1
LIMIT 1
)
)
TABLE cte;
ALTER TABLE labels
ADD PRIMARY KEY (id_dec)
, ALTER COLUMN label SET NOT NULL
, ADD CONSTRAINT label_uni UNIQUE (label)
;
Why this way? See:
Optimize GROUP BY query to retrieve latest row per user
Main query
WITH bounds(lo, hi) AS (
SELECT timestamptz '2022-11-25 0:0+02' -- enter time interval here *once*
, timestamptz '2022-11-26 0:0+02'
)
, snapshot AS (
SELECT id_dec, label, is1, ts
FROM switch_times s, bounds b
WHERE s.ts >= b.lo
AND s.ts < b.hi
UNION ALL -- must be separate
SELECT s.*
FROM labels l
JOIN LATERAL ( -- latest earlier entry
SELECT s.id_dec, s.label, s.is1, b.lo AS ts -- cut off at lower bound
FROM switch_times s, bounds b
WHERE s.id_dec = l.id_dec
AND s.ts < b.lo
ORDER BY s.ts DESC
LIMIT 1
) s ON s.is1 -- ... if it's "on"
)
SELECT label, sum(z - a) AS duration
FROM (
SELECT label
, lag(is1, 1, false) OVER w AS last_is1
, lag(ts) OVER w AS a
, ts AS z
FROM snapshot
WINDOW w AS (PARTITION BY label ORDER BY ts ROWS UNBOUNDED PRECEDING)
) sub
WHERE last_is1
GROUP BY 1;
fiddle
CTE bounds is an optional convenience feature to enter lower and upper bound of your time interval once.
CTE snapshot collects all rows of interest, which consists of
all rows inside the time interval (1st leg of UNION ALL query)
the latest earlier row if it was "on" (2nd leg of UNION ALL query)
We need to gather 2. separately to cover corner cases where the light was switched on earlier and there is no entry for the given time interval! But we can replace the timestamp to the lower bound immediately.
The final query gets the previous (is1, ts) for every row in a subquery, defaulting to "off" if there was no previous row.
Finally sum up intervals in the outer SELECT. Only sum what's switched on at the begin (no matter the final state).
Related:
Jump SQL gap over specific condition & proper lead() usage

My assumption
actual on time is time difference between is1 is true to next is1 false order by ts
Below query will calculate total sum of on time between two dates
select
id_dec ,
label,
sum(to_timestamp(nexttime)-ts) as time_def
from
(
select
id_dec,
"label",
ts,
is1,
case
when is1 = true then lead(extract(epoch from ts))over(partition by id_dec
order by
id_dec ,
ts asc)
else 0
end nexttime
from
switch_times
where
ts between '2022-11-24' and '2022-11-28'
) as a
where
nexttime <> 0
group by
id_dec,
label

How to select k-th record per field in a single SQL query

please help me with the following problem. I have spent already one week trying to put all the logic into one SQL query but still got no elegant result. I hope the SQL experts could give me a hint,
I have a table which has 4 fields: date, expire_month, expire_year and value. The primary key is defined on 3 first fields. Thus for a concrete date few values are present with different expire_month, expire_year. I need to chose one value from them for every date, present in the table.
For example, when I execute a query:
SELECT date, expire_month, expire_year, value FROM futures
WHERE date = ‘1989-12-01' ORDER BY expire_year, expire_month;
I get a list of values for the same date sorted by expirity (months are coded with letters):
1989-12-01 Z 1989 408.25
1989-12-01 H 1990 408.25
1989-12-01 K 1990 389
1989-12-01 N 1990 359.75
1989-12-01 U 1990 364.5
1989-12-01 Z 1990 375
The correct single value for that date is the k-th record from top. For example, of k is 2 then the «correct single» record would be:
1989-12-01 H 1990 408.25
How can I select these «correct single» values for every date in my table?

You can do it with rank():
select t.date, t.expire_month, t.expire_year, t.value from (
select *,
rank() over(partition by date order by expire_year, expire_month) rn
from futures
) t
where t.rn = 2
The column rn in the subquery, is actually the rank of the row grouped by date. Change 2 to the rank you want.

While forpas's answer is the better one (Though I think I'd use row_number() instead of rank() here), window functions are fairly recent additions to Sqlite (In 3.25). If you're stuck on an old version and can't upgrade, here's an alternative:
SELECT date, expire_month, expire_year, value
FROM futures AS f
WHERE (date, expire_month, expire_year) =
(SELECT f2.date, f2.expire_month, f2.expire_year
FROM futures AS f2
WHERE f.date = f2.date
ORDER BY f2.expire_year, f2.expire_month
LIMIT 1 OFFSET 1)
ORDER BY date;
The OFFSET value is 1 less than the Kth row - so 1 for the second row, 2 for the third row, etc.
It executes a correlated subquery for every row in the table, though, which isn't ideal. Hopefully your composite primary key columns are in the order date, expire_year, expire_month, which will help a lot by eliminating the need for additional sorting in it.

You can try the following query.
select * from
(
SELECT rownum seq, date1, expire_month, expire_year, value FROM testtable
WHERE date1 = to_date('1989-12-01','yyyy-mm-dd')
ORDER BY expire_year, expire_month
)
where seq=2

SQL Server iterating through time series data

I am using SQL Server and wondering if it is possible to iterate through time series data until specific condition is met and based on that label my data in other table?
For example, let's say I have a table like this:
Id Date Some_kind_of_event
+--+----------+------------------
1 |2018-01-01|dsdf...
1 |2018-01-06|sdfs...
1 |2018-01-29|fsdfs...
2 |2018-05-10|sdfs...
2 |2018-05-11|fgdf...
2 |2018-05-12|asda...
3 |2018-02-15|sgsd...
3 |2018-02-16|rgw...
3 |2018-02-17|sgs...
3 |2018-02-28|sgs...
What I want to get, is to calculate for each key the difference between two adjacent events and find out if there exists difference > 10 days between these two adjacent events. In case yes, I want to stop iterating for that specific key and put label 'inactive', otherwise 'active' in my other table. After we finish with one key, we start with another.
So for example id = 1 would get label 'inactive' because there exists two dates which have difference bigger that 10 days. The final result would be like that:
Id Label
+--+----------+
1 |inactive
2 |active
3 |inactive
Any ideas how to do that? Is it possible to do it with SQL?

When working with a DBMS you need to get away from the idea of thinking iteratively. Instead you need to try and think in sets. "Instead of thinking about what you want to do to a row, think about what you want to do to a column."
If I understand correctly, is this what you're after?
CREATE TABLE SomeEvent (ID int, EventDate date, EventName varchar(10));
INSERT INTO SomeEvent
VALUES (1,'20180101','dsdf...'),
(1,'20180106','sdfs...'),
(1,'20180129','fsdfs..'),
(2,'20180510','sdfs...'),
(2,'20180511','fgdf...'),
(2,'20180512','asda...'),
(3,'20180215','sgsd...'),
(3,'20180216','rgw....'),
(3,'20180217','sgs....'),
(3,'20180228','sgs....');
GO
WITH Gaps AS(
SELECT *,
DATEDIFF(DAY,LAG(EventDate) OVER (PARTITION BY ID ORDER BY EventDate),EventDate) AS EventGap
FROM SomeEvent)
SELECT ID,
CASE WHEN MAX(EventGap) > 10 THEN 'inactive' ELSE 'active' END AS Label
FROM Gaps
GROUP BY ID
ORDER BY ID;
GO
DROP TABLE SomeEvent;
GO
This assumes you are using SQL Server 2012+, as it uses the LAG function, and SQL Server 2008 has less than 12 months of any kind of support.

Try this. Note, replace #MyTable with your actual table.
WITH Diffs AS (
SELECT
Id
,DATEDIFF(DAY,[Date],LEAD([Date],1,0) OVER (ORDER BY [Id], [Date])) Diff
FROM #MyTable)
SELECT
Id
,CASE WHEN MAX(Diff) > 10 THEN 'Inactive' ELSE 'Active' END
FROM Diffs
GROUP BY Id

Just to share another approach (without a CTE).
SELECT
ID
, CASE WHEN SUM(TotalDays) = (MAX(CNT) - 1) THEN 'Active' ELSE 'Inactive' END Label
FROM (
SELECT
ID
, EventDate
, CASE WHEN DATEDIFF(DAY, EventDate, LEAD(EventDate) OVER(PARTITION BY ID ORDER BY EventDate)) < 10 THEN 1 ELSE 0 END TotalDays
, COUNT(ID) OVER(PARTITION BY ID) CNT
FROM EventsTable
) D
GROUP BY ID
The method is counting how many records each ID has, and getting the TotalDays by date differences (in days) between the current the next date, if the difference is less than 10 days, then give me 1, else give me 0.
Then compare, if the total days equal the number of records that each ID has (minus one) would print Active, else Inactive.
This is just another approach that doesn't use CTE.

SQL - Different sum levels in one select with where clause

I have 2 tables. One has the orginal amount that remains static. The second table has a list of partial amounts applied over time against the orginal amount in the first table.
DB Tables:
***memotable***
ID [primary, unique]
Amount (Orginal Amount)
***transtable***
ID [many IDs in transtable to single ID in memotable]
AmountUsed (amount applied)
ApplyDate (date applied)
I would like to find, in a single select, the ID, amount used since last week (ApplyDate > 2011-04-21), amount used to date.
The only rows that should appear in the result is when an amount has been used since last week (ApplyDate > 2011-04-21).
I'm stuck on trying to get the sum for the amount used to date, since that needs to include AmountUsed values that are outside of when ApplyDate > 2011-04-21.

It is possible to avoid subselects in this case:
SELECT
ID,
AmountUsedSinceLastWeek = SUM(CASE WHEN ApplyDate > '4/21/2011' THEN AmountUsed END)
AmountUsedToDate = SUM(AmountUsed)
FROM TransTable
GROUP BY ID

Since you want to limit it to rows that happened since last week, but also want to include the total to date, I think the most efficient method would be to use sub-selects...
SELECT
lastWeek.ID,
lastWeek.AmountUsedSinceLastWeek,
toDate.AmountUsedToDate
FROM
(
SELECT
ID,
SUM(AmountUsed) AS AmountUsedSinceLastWeek
FROM TransTable
WHERE ApplyDate > '4/21/2011'
GROUP BY ID
) lastWeek JOIN
(
SELECT
ID,
SUM(AmountUsed) AS AmountUsedToDate
FROM TransTable
GROUP BY ID
) toDate ON lastWeek.ID = toDate.ID

MySQL "ORDER BY" the amount of rows with the same value for a certain column?

I have a table called trends_points, this table has the following columns:
id (the unique id of the row)
userId (the id of the user that has entered this in the table)
term (a word)
time (a unix timestamp)
Now, I'm trying to run a query on this table which will get the rows in a specific time frame ordered by how many times the column term appears in the table during the specific timeframe...So for example if the table has the following rows:
id | userId | term | time
------------------------------------
1 28 new year 1262231638
2 37 new year 1262231658
3 1 christmas 1262231666
4 34 new year 1262231665
5 12 christmas 1262231667
6 52 twitter 1262231669
I'd like the rows to come out ordered like this:
new year
christmas
twitter
This is because "new year" exists three times in the timeframe, "christmas" exists twice and "twitter" is only in one row.
So far I've asummed it's a simple WHERE for the specific timeframe part of the query and a GROUP BY to stop the same term from coming up twice in the list.
This makes the following query:
SELECT *
FROM `trends_points`
WHERE ( time >= <time-period_start>
AND time <= <time-period_end> )
GROUP BY `term`
Does anyone know how I'd do the final part of the query? (Ordering the query's results by how many rows contain the same "term" column value..).

Use:
SELECT tp.term,
COUNT(*) 'term_count'
FROM TREND_POINTS tp
WHERE tp.time BETWEEN <time-period_start> AND <time-period_end>
GROUP BY tp.term
ORDER BY term_count DESC, tp.term
See this question about why to use BETWEEN vs using the >=/<= operators.
Keep in mind there can be ties - the order by defaults to alphabetically shorting by term value when this happens, but there could be other criteria.
Also, if you want to additionally limit the number of rows/terms coming back you can add the LIMIT clause to the end of the query. For example, this query will return the top five terms:
SELECT tp.term,
COUNT(*) 'term_count'
FROM TREND_POINTS tp
WHERE tp.time BETWEEN <time-period_start> AND <time-period_end>
GROUP BY tp.term
ORDER BY term_count DESC, tp.term
LIMIT 5

Quick answer:
SELECT
term, count(*) as thecount
FROM
mytable
WHERE
(...)
GROUP BY
term
ORDER BY
thecount DESC

SELECT t.term
FROM trend_points t
WHERE t.time >= <time-period_start> AND t.time <= <time-period_end>
ORDER BY COUNT(t.term) DESC
GROUP BY t.term

COUNT() will give you the number of rows in the group, so just order by that.
SELECT * FROM `trends_points`
WHERE ( `time` >= <time-period_start> AND `time` <= <time-period_end> )
ORDER BY COUNT(`term`) DESC
GROUP BY `term`

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL query group by nearby timestamp - sql

Related

Sum over a given time period

How to select k-th record per field in a single SQL query

SQL Server iterating through time series data

SQL - Different sum levels in one select with where clause

MySQL "ORDER BY" the amount of rows with the same value for a certain column?

Categories

Resources