Sum over a given time period - sql

The following codes gives the total duration that a light has been switched on.
CREATE TABLE switch_times (
id SERIAL PRIMARY KEY,
is1 BOOLEAN,
id_dec INTEGER,
label TEXT,
ts TIMESTAMP WITH TIME ZONE default current_timestamp
);
CREATE VIEW makecount AS
SELECT *, row_number() OVER (PARTITION BY id_dec ORDER BY id) AS count
FROM switch_times;
select c1.label, SUM(c2.ts-c1.ts) AS sum
from
(makecount AS c1
inner join
makecount AS c2 ON c2.count = c1.count + 1)
where c2.is1=FALSE AND c1.id_dec = c2.id_dec AND c2.is1 != c1.is1
GROUP BY c1.label;
Link to working demo https://dbfiddle.uk/ZR8pLEBk
Any suggestions on how to alter the code so that it would give the sum over a given specific time period, say the 25th, during which all three lights were switched on for 12 hours? Problem 1: current code gives total sum, as follows. Problem 2: all durations that have not ended are disregarded, because there is no switch off time.
label sum
0x29 MH3 1 day 03:00:00
0x2B MH1 1 day 01:00:00
0x2C MH2 1 day 02:00:00
The expected results is just over a a given date, i.e.
label sum
0x29 MH3 12:00:00
0x2B MH1 12:00:00
0x2C MH2 12:00:00

Assuming the following (which should be defined in the question):
Postgres 15.
The table is big, many rows per label, performance matters, we can add indexes.
All columns are actually NOT NULL, you just forgot to declare columns as such.
Evey "light" has a distinct id_dec and a distinct label. Having both in switch_times is redundant. (Normalization!)
A light is "switched on" if the most recent earlier entry has is1 IS TRUE. Else it's considered "off".
The order of rows is established by ts, not by id as used in your query (typically incorrect).
Consecutive entries do not have to change the state.
No duplicate entries for (id_dec, ts). (There is a unique index enforcing that.)
There is no minimum or maximum time interval between entries.
"The 25th" is supposed to mean tstzrange '[2022-11-25 0:0+02, 2022-11-26 0:0+02)' (Note the time zone offsets.)
You want results for all labels that were switched on at all during the given time interval.
There is a table "labels" with one distinct entry per relevant light. If you don't have one, create it.
Indexes
Have at least these indexes to make everything fast:
CREATE INDEX ON switch_times (id_dec, ts DESC);
CREATE INDEX ON switch_times (ts);
Optional step to create table labels
CREATE TABLE labels AS
WITH RECURSIVE cte AS (
(
SELECT id_dec, label
FROM switch_times
ORDER BY 1
LIMIT 1
)
UNION ALL
(
SELECT s.id_dec, s.label
FROM cte c
JOIN switch_times s ON s.id_dec > c.id_dec
ORDER BY 1
LIMIT 1
)
)
TABLE cte;
ALTER TABLE labels
ADD PRIMARY KEY (id_dec)
, ALTER COLUMN label SET NOT NULL
, ADD CONSTRAINT label_uni UNIQUE (label)
;
Why this way? See:
Optimize GROUP BY query to retrieve latest row per user
Main query
WITH bounds(lo, hi) AS (
SELECT timestamptz '2022-11-25 0:0+02' -- enter time interval here *once*
, timestamptz '2022-11-26 0:0+02'
)
, snapshot AS (
SELECT id_dec, label, is1, ts
FROM switch_times s, bounds b
WHERE s.ts >= b.lo
AND s.ts < b.hi
UNION ALL -- must be separate
SELECT s.*
FROM labels l
JOIN LATERAL ( -- latest earlier entry
SELECT s.id_dec, s.label, s.is1, b.lo AS ts -- cut off at lower bound
FROM switch_times s, bounds b
WHERE s.id_dec = l.id_dec
AND s.ts < b.lo
ORDER BY s.ts DESC
LIMIT 1
) s ON s.is1 -- ... if it's "on"
)
SELECT label, sum(z - a) AS duration
FROM (
SELECT label
, lag(is1, 1, false) OVER w AS last_is1
, lag(ts) OVER w AS a
, ts AS z
FROM snapshot
WINDOW w AS (PARTITION BY label ORDER BY ts ROWS UNBOUNDED PRECEDING)
) sub
WHERE last_is1
GROUP BY 1;
fiddle
CTE bounds is an optional convenience feature to enter lower and upper bound of your time interval once.
CTE snapshot collects all rows of interest, which consists of
all rows inside the time interval (1st leg of UNION ALL query)
the latest earlier row if it was "on" (2nd leg of UNION ALL query)
We need to gather 2. separately to cover corner cases where the light was switched on earlier and there is no entry for the given time interval! But we can replace the timestamp to the lower bound immediately.
The final query gets the previous (is1, ts) for every row in a subquery, defaulting to "off" if there was no previous row.
Finally sum up intervals in the outer SELECT. Only sum what's switched on at the begin (no matter the final state).
Related:
Jump SQL gap over specific condition & proper lead() usage

My assumption
actual on time is time difference between is1 is true to next is1 false order by ts
Below query will calculate total sum of on time between two dates
select
id_dec ,
label,
sum(to_timestamp(nexttime)-ts) as time_def
from
(
select
id_dec,
"label",
ts,
is1,
case
when is1 = true then lead(extract(epoch from ts))over(partition by id_dec
order by
id_dec ,
ts asc)
else 0
end nexttime
from
switch_times
where
ts between '2022-11-24' and '2022-11-28'
) as a
where
nexttime <> 0
group by
id_dec,
label

Related

Exclude nulls from postgres window function

For each row, I want to take average of last 20 non-null values using.
The window function is taking 20 rows below including the null ones and calculating average, while I want average of last 20 non null rows.
What I have tried?
WITH adv_calculated AS (
SELECT security_id AS sec_id,
date AS pd,
AVG(volume)
OVER (PARTITION BY (security_id) ORDER BY date ROWS BETWEEN 20 PRECEDING AND CURRENT ROW) AS adv
FROM tbl_financial_index_data
WHERE volume IS NOT NULL
GROUP BY security_id, date
)
UPDATE tbl_financial_index_data
SET adv = amc.adv
FROM adv_calculated amc
WHERE amc.sec_id = security_id
AND amc.pd = date
This solution works good for all rows where volume is NOT null but it does not calculate the adv/average for the rows where volume is NULL.
Then for the null adv and volume rows, I have to run this query which is really slow
UPDATE tbl_financial_index_data
SET average_daily_volume =
(SELECT avg(t.volume)
FROM (
SELECT a.volume
FROM tbl_financial_index_data a
WHERE a.security_id = tbl_financial_index_data.security_id
AND a.date::date <= tbl_financial_index_data.date::date
AND a.volume IS NOT NULL
ORDER BY a.date DESC
LIMIT 21
) t)
WHERE volume IS NULL;
I want to avoid using the second query and calculate ADV for all rows using first query (because it is much faster).
Simply omit the WHERE condition WHERE volume IS NOT NULL, then you should get what you want.
You can nest the query in an outer query to remove the undesired values later:
WITH adv_calculated AS (
SELECT ...
FROM (SELECT AVG(volume) OVER (... ROWS BETWEEN ... AND ...),
...
FROM tbl_financial_index_data
GROUP BY security_id, date
) AS subq
WHERE volume IS NOT NULL
)
SELECT ...
This is just a work around
Okay, so I found the solution.
If the volume is null for some row then the adv for that row is going to be equal to the previous non-null volume row's adv, so I had to find some way to carry forward previous non-null adv for rows where it is null.
I was able to find a way to do that from this answer.
Here's the code to carry forward the non-null value:
WITH temp_adv_filled_values AS (
SELECT security_id,
date,
FIRST_VALUE(adv) OVER W AS adv
FROM (
SELECT security_id,
date,
adv,
SUM(CASE WHEN volume IS NULL THEN 0 ELSE 1 END)
OVER (PARTITION BY security_id ORDER BY date ) AS value_partition
FROM tbl_financial_index_data
) AS q
WINDOW W AS (PARTITION BY security_id, value_partition ORDER BY date)
)
UPDATE tbl_financial_index_data tfid
SET adv = tmcfv.adv
FROM temp_adv_filled_values tmcfv
WHERE tmcfv.security_id = tfid.security_id
AND tmcfv.date = tfid.date;

Date convolution thru timestamped transition table -- how to

We need to convolute every day from a point in the past up to now against a list of timestamped, boolean device transitions. The final output should be a table that has a date:device_id entry for every day it is online (otherwise no entry for that date).
Here is an example transition table for a single device:
To generate the convolution calendar:
calendar AS (
SELECT day
FROM UNNEST (GENERATE_DATE_ARRAY('2011-05-15', CURRENT_DATE())) AS day
),
Then, to generate at least a table that only has the transition dates AFTER the transition event, so they can then subsequently be ranked and the most recent chosen (CROSS JOIN here -- yuck!):
joined_with_cal AS (
SELECT
cal.day as online_date,
otr.when_changed,
otr.device_id,
otr.is_online,
otr.rank_by_date
FROM
calendar AS cal
CROSS JOIN
ordered_transitions otr
WHERE
cal.day >= DATE(otr.when_changed)
),
Then, the code that attempts to rank and choose the most recent record in the partition by timestamp (when_changed or ranked_by_date -- neither seems to work):
SELECT
online_date,
when_changed,
device_id,
is_online,
rank_by_date,
FROM (
SELECT
online_date,
when_changed,
device_id,
is_online,
rank_by_date,
RANK() OVER (PARTITION BY device_id ORDER BY rank_by_date ASC) as final_rank
FROM
joined_with_cal
)
WHERE
final_rank = 1 AND
-- online_date < '2017-08-01' AND
device_id = 419609
ORDER BY
online_date,
when_changed,
device_id
However, this doesn't work and is obviously ugly.
Can someone suggest a correct, elegant solution?
Thanks in advance!
#Mikhail: thanks for looking at it and sorry my explanation was not clearer.
After a discussion with a colleague, I ended up using a self-join which seems to work:
trans_as_range_not_first AS (
SELECT
t1.device_id,
t1.rank_by_when,
t2.when_changed as online_start,
t1.when_changed as online_stop,
t1.account_id,
t1.account_name,
t1.server_type
FROM
ordered_trans AS t1 -- lower in rank index, later in time
LEFT JOIN
ordered_trans AS t2 -- greater in rank index, earlier in time
ON
t1.device_id = t2.device_id AND
t1.rank_by_when+1 = t2.rank_by_when -- current and next row
WHERE
t1.is_online = 0 AND t2.is_online = 1
GROUP BY
device_id,
rank_by_when,
online_start,
online_stop,
account_id,
account_name,
server_type
),

Recognize if not enough resources during timeframe

i need an idea how to solve the following problem.
Lets say i have one group with given timeframe (8:00-12:00) and i can assign resources (people) to it. Each resource can have a custom timeframe (like 9-10, 9-12,8-12 etc.) and could be assigned multiple times.
Tables
Groups
ID,
TITLE,
START_TIME,
END_TIME,
REQUIRED_PEOPLE:INTEGER
PeopleAssignments
ID,
USER_ID,
GROUP_ID,
START_TIME,
END_TIME
Now i have the rule that at any given time during the group timeframe that there have to be like like 4 people assigned. Otherwise i want to get a warning.
I am working with ruby & sql (Postgres) here.
is there an elegant way without iterating through the whole timeframe and checking if count(assignments) > REQUIRED_PEOPLE
You can solve this with only SQL too (if you are interested in such answers).
Range types offers great functions and operators to calculate with.
These solutions will give you rows, when there are sub-ranges, where there is some missing people from a given group (and it will give you which sub-range it is exactly & how many people is missing from the required number).
The easy way:
You wanted to try something similar to this. You'll need to pick some interval in which the count() is based on (I picked 5 minutes):
select g.id group_id, i start_time, i + interval '5 minutes' end_time, g.required_people - count(a.id)
from groups g
cross join generate_series(g.start_time, g.end_time, interval '5 minutes') i
left join people_assignments a on a.group_id = g.id
where tsrange(a.start_time, a.end_time) && tsrange(i, i + interval '5 minutes')
group by g.id, i
having g.required_people - count(a.id) > 0
order by g.id, i
But note that you won't be able to detect missing sub-ranges, when they are less than 5 minutes. F.ex. user1 has assignment for 11:00-11:56 and user2 has one for 11:59-13:00, they will appear to be "in" the group for 11:00-13:00 (so the missing sub-range of 11:56-11:59 will go unnoticed).
Note: the more short the interval is (what you've picked) the more precise (and slow!) the results will be.
http://rextester.com/GRC64969
The hard way:
You can accumulate the result on-the-fly with custom aggregates or recursive CTEs
with recursive r as (
-- start with "required_people" as "missing_required_people" in the whole range
select 0 iteration,
id group_id,
array[]::int[] used_assignment_ids,
-- build a json map, where keys are the time ranges
-- and values are the number of missing people for that range
jsonb_build_object(tsrange(start_time, end_time), required_people) required_people_per_time_range
from groups
where required_people > 0
and id = 1 -- query parameter
union all
select r.iteration + 1,
r.group_id,
r.used_assignment_ids || a.assignment_id,
d.required_people_per_time_range
from r
-- join a single assignment to the previous iteration, where
-- the assigment's time range overlaps with (at least one) time range,
-- where there is still missing people. when there are no such time range is
-- found in assignments, the "recursion" (which is really just a loop) stops
cross join lateral (
select a.id assignment_id, tsrange(start_time, end_time) time_range
from people_assignments a
cross join (select key::tsrange time_range from jsonb_each(r.required_people_per_time_range)) j
where a.group_id = r.group_id
and a.id <> ALL (r.used_assignment_ids)
and tsrange(start_time, end_time) && j.time_range
limit 1
) a
-- "partition" && accumulate all remaining time ranges with
-- the one found in the previous step
cross join lateral (
-- accumulate "partition" results
select jsonb_object_agg(u.time_range, u.required_people) required_people_per_time_range
from (select key::tsrange time_range, value::int required_people
from jsonb_each_text(r.required_people_per_time_range)) j
cross join lateral (
select u time_range, j.required_people - case when u && a.time_range then 1 else 0 end required_people
-- "partition" the found time range with all existing ones, one-by-one
from unnest(case
when j.time_range #> a.time_range
then array[tsrange(lower(j.time_range), lower(a.time_range)), a.time_range, tsrange(upper(a.time_range), upper(j.time_range))]
when j.time_range && a.time_range
then array[j.time_range * a.time_range, j.time_range - a.time_range]
else array[j.time_range]
end) u
where not isempty(u)
) u
) d
),
-- select only the last iteration
l as (
select group_id, required_people_per_time_range
from r
order by iteration desc
limit 1
)
-- unwind the accumulated json map
select l.group_id, lower(time_range) start_time, upper(time_range) end_time, missing_required_people
from l
cross join lateral (
select key::tsrange time_range, value::int missing_required_people
from jsonb_each_text(l.required_people_per_time_range)
) j
-- select only where there is still some missing people
-- this is optional, if you omit it you'll also see row(s) for sub-ranges where
-- there is enough people in the group (these rows will have zero,
-- or negative amount of "missing_required_people")
where j.missing_required_people > 0
http://rextester.com/GHPD52861
In any case you need to query number of assignment in DB. There is no other way to find how many times a group assign to people.
There might be ways to find number of assignment but in the end you have to fire a query to DB.
#group = Group.find(id)
if #group.people_assignments.count >= REQUIRED_PEOPLE
pus 'warning'
end
You can add extra column in group that hold information how many times that group assign to people. In this way one query to server reduced.
#group = Group.find(id)
if #group.count_people_assigned >= REQUIRED_PEOPLE
puts 'warning'
end
In second case count_people_assigned is column so no extra query will execute while in first case people_assignments is association so one extra query will fire.
But in second case you have you update group each time you assign group to people. Ultimately extra query. Your choice where you want to reduce query.
My opinion is second case, It will happen rare than first.

SQL query group by nearby timestamp

I have a table with a timestamp column. I would like to be able to group by an identifier column (e.g. cusip), sum over another column (e.g. quantity), but only for rows that are within 30 seconds of each other, i.e. not in fixed 30 second bucket intervals. Given the data:
cusip| quantity| timestamp
============|=========|=============
BE0000310194| 100| 16:20:49.000
BE0000314238| 50| 16:38:38.110
BE0000314238| 50| 16:46:21.323
BE0000314238| 50| 16:46:35.323
I would like to write a query that returns:
cusip| quantity
============|=========
BE0000310194| 100
BE0000314238| 50
BE0000314238| 100
Edit:
In addition, it would greatly simplify things if I could also get the MIN(timestamp) out of the query.
From Sean G solution, I have removed Group By on complete Table. In Fact re adjected few parts for Oracle SQL.
First after finding previous time, assign self parent id. If there a null in Previous Time, then we exclude giving it an ID.
Now based on take the nearest self parent id by avoiding nulls so that all nearest 30 seconds cusip fall under one Group.
As There is a CUSIP column, I assumed the dataset would be large market transactional data. Instead using group by on complete table, use partition by CUSIP and final Group Parent ID for better performance.
SELECT
id,
sub.parent_id,
sub.cusip,
timestamp,
quantity,
sum(sub.quantity) OVER(
PARTITION BY cusip, parent_id
) sum_quantity,
MIN(sub.timestamp) OVER(
PARTITION BY cusip, parent_id
) min_timestamp
FROM
(
SELECT
base_sub.*,
CASE
WHEN base_sub.self_parent_id IS NOT NULL THEN
base_sub.self_parent_id
ELSE
LAG(base_sub.self_parent_id) IGNORE NULLS OVER(
PARTITION BY cusip
ORDER BY
timestamp, id
)
END parent_id
FROM
(
SELECT
c.*,
CASE
WHEN nvl(abs(EXTRACT(SECOND FROM to_timestamp(previous_timestamp, 'yyyy/mm/dd hh24:mi:ss') - to_timestamp
(timestamp, 'yyyy/mm/dd hh24:mi:ss'))), 31) > 30 THEN
id
ELSE
NULL
END self_parent_id
FROM
(
SELECT
my_table.id,
my_table.cusip,
my_table.timestamp,
my_table.quantity,
LAG(my_table.timestamp) OVER(
PARTITION BY my_table.cusip
ORDER BY
my_table.timestamp, my_table.id
) previous_timestamp
FROM
my_table
) c
) base_sub
) sub
Below are the Table Rows
Input Data:
Below is the Output
RESULT
Following may be helpful to you.
Grouping of 30 second periods stating form a given time. Here it is '2012-01-01 00:00:00'. DATEDIFF counts the number of seconds between time stamp value and stating time. Then its is divided by 30 to get grouping column.
SELECT MIN(TimeColumn) AS TimeGroup, SUM(Quantity) AS TotalQuantity FROM YourTable
GROUP BY (DATEDIFF(ss, TimeColumn, '2012-01-01') / 30)
Here minimum time stamp of each group will output as TimeGroup. But you can use maximum or even grouping column value can be converted to time again for display.
Looking at the above comments, I'm assuming Chris's first scenario is the one you want (all 3 get grouped even though values 1 and 3 are not within 30 seconds of eachother, but are each within 30 seconds of value 2). Also going to assume that each row in your table has some unique ID called 'id'. You can do the following:
Create a new grouping, determining if the preceding row in your partition is more than 30 seconds behind the current row (e.g. determine if you need a new 30 second grouping, or to continue the previous). We'll call that parent_id.
Sum quantity over parent_id (plus any other aggregations)
The code could look like this
select
sub.parent_id,
sub.cusip,
min(sub.timestamp) min_timestamp,
sum(sub.quantity) quantity
from
(
select
base_sub.*,
case
when base_sub.self_parent_id is not null
then base_sub.self_parent_id
else lag(base_sub.self_parent_id) ignore nulls over (
partition by
my_table.cusip
order by
my_table.timestamp,
my_table.id
) parent_id
from
(
select
my_table.id,
my_table.cusip,
my_table.timestamp,
my_table.quantity,
lag(my_table.timestamp) over (
partition by
my_table.cusip
order by
my_table.timestamp,
my_table.id
) previous_timestamp,
case
when datediff(
second,
nvl(previous_timestamp, to_date('1900/01/01', 'yyyy/mm/dd')),
my_table.timestamp) > 30
then my_table.id
else null
end self_parent_id
from
my_table
) base_sub
) sub
group by
sub.time_group_parent_id,
sub.cusip

How to do self join on min/max

I am new to sql queries.
Table is defined as
( symbol varchar,
high int,
low int,
today date,
Primary key (symbol, today)
)
I need to find for each symbol in a given date range, max(high) and min(low) and corresponding dates for max(high) and min(low).
Okay to get first max date and min date in given table.
In a given date range some dates may be missing. If start date is not present then next date should be used and if last date is not present then earlier available date should be used
Data is for one year and around 5000 symbols.
I tried something like this
SELECT a.symbol,
a.maxValue,
a.maxdate,
b.minValue,
b.mindate
FROM (
SELECT table1.symbol, max_a.maxValue, max_a.maxdate
FROM table1
INNER JOIN (
SELECT table1.symbol,
max(table1.high) AS maxValue,
table1.TODAY AS maxdate
FROM table1
GROUP BY table1.symbol
) AS max_a
ON max_a.symbol = table1.symbol
AND table1.today = max_a.maxdate
) AS a
INNER JOIN (
SELECT symbol,
min_b.minValue,
min_b.mindate
FROM table1
INNER JOIN (
SELECT symbol,
min(low) AS minValue,
table1.TODAY AS mindate
FROM table1
GROUP BY testnsebav.symbol
) AS min_b
ON min_b.symbol = table1.symbol
AND table1.today = min_b.mindate
) AS b
ON a.symbol = b.symbol
The first INNER query pre-qualifies for each symbol what the low and high values are within the date range provided. After that, it joins back to the original table again (for same date range criteria), but also adds the qualifier that EITHER the low OR the high matches the MIN() or MAX() from the PreQuery. If so, allows it in the result set.
Now, the result columns. Not knowing which version SQL you were using, I have the first 3 columns as the "Final" values. The following 3 columns after that come from the record that qualified by EITHER of the qualifiers. As stocks go up and down all the time, its possible for the high and/or low values to occur more than once within the same time period. This will include ALL those entries that qualify the MIN() / MAX() criteria.
select
PreQuery.Symbol,
PreQuery.LowForSymbol,
PreQuery.HighForSymbol,
tFinal.Today as DateOfMatch,
tFinal.Low as DateMatchLow,
tFinal.High as DateMatchHigh
from
( select
t1.symbol,
min( t1.low ) as LowForSymbol,
max( t1.high ) as HighForSymbol
from
table1 t1
where
t1.today between YourFromDateParameter and YourToDateParameter
group by
t1.symbol ) PreQuery
JOIN table1 tFinal
on PreQuery.Symbol = tFinal.Symbol
AND tFinal.today between YourFromDateParameter and YourToDateParameter
AND ( tFinal.Low = LowForSymbol
OR tFinal.High = HighForSymbol )