Potsgresql Query with overlapping dates, multiple arrays

Potsgresql Query with overlapping dates, multiple arrays - sql

Edit: I have edited this question to make more understandable. Excuse me for any misunderstandings.
I have a temporary table with columns
zone_name, nodeid, nodelabel, nodegainedservice, nodelostservice
Zone1, 3, Windows-SRV1, "2012-11-27 13:10:30+08", "2012-11-27 13:00:40+08"
Zone1, 5, Windows-SRV2, "2012-12-20 13:10:30+08", "2012-12-18 13:00:40+08"
....
....
Many zones and many nodes and same nodes with gained service and lost service many times.
nodegainedservice meaning node has come alive and nodelostservice meaning node has gone down.
How could I make a query to fetch each zone availability in a period?
e.g., Zone1 have Windows-SRV1, Windows-SRV2. Find how many times and how long Zone1 is down. These servers are replication servers, zone goes down when all the servers in the zone are down at some time and comes up if any of them comes alive.
Please use the below sample data
zonename nodeid nodelabel noderegainedservice nodelostservice
Zone1 27 Windows-SRV1 2013-02-21 10:04:56+08 2013-02-21 09:48:48+08
Zone1 27 Windows-SRV1 2013-02-21 10:14:01+08 2013-02-21 10:09:27+08
Zone1 27 Windows-SRV1 2013-02-22 10:26:29+08 2013-02-22 10:24:20+08
Zone1 27 Windows-SRV1 2013-02-22 11:27:24+08 2013-02-22 11:25:15+08
Zone1 27 Windows-SRV1 2013-02-28 16:24:59+08 2013-02-28 15:52:59+08
Zone1 27 Windows-SRV1 2013-02-28 16:56:19+08 2013-02-28 16:40:18+08
Zone1 39 Windows-SRV2 2013-02-21 13:15:53+08 2013-02-21 12:26:04+08
Zone1 39 Windows-SRV2 2013-02-23 13:23:10+08 2013-02-22 10:21:14+08
Zone1 39 Windows-SRV2 2013-02-24 13:35:23+08 2013-02-23 13:33:32+08
Zone1 39 Windows-SRV2 2013-02-26 15:17:25+08 2013-02-25 14:25:51+08
Zone1 39 Windows-SRV2 2013-02-28 18:49:56+08 2013-02-28 15:43:01+08
Zone1 13 Windows-SRV3 2013-02-22 17:23:59+08 2013-02-22 10:19:13+08
Zone1 13 Windows-SRV3 2013-02-28 16:54:27+08 2013-02-28 16:13:48+08
Output zone_outages as follows
e.g.,
zonename duration from_time to_time
zone1 00:02:09 2013-02-22 10:24:20+08 2013-02-22 10:26:29+08
zone1 00:02:09 2013-02-22 11:25:15+08 2013-02-22 11:27:24+08
zone1 00:11:11 2013-02-28 16:13:48+08 2013-02-28 16:24:59+08
zone1 00:14:09 2013-02-28 16:40:18+08 2013-02-28 16:54:27+08
Note: There could be entries like this
Zone2 24 Windows-SRV12 \n \n
In this case Zone2 Windows-SRV12 has never gone down and Zone2 availability will be 100%.

Have you considered PG 9.2's range type instead of two separate timestamp fields?
http://www.postgresql.org/docs/9.2/static/rangetypes.html
Something like:
CREATE TABLE availability (
zone_name varchar, nodeid int, nodelabel varchar, during tsrange
);
INSERT INTO availability
VALUES (zone1, 3, 'srv1', '[2013-01-01 14:30, 2013-01-01 15:30)');
Unless I'm mistaking, you'd then be able to work with unions, intersections and such, which should make your work simpler. There are likely a few aggregate functions I'm unfamiliar with that cater to the latter, too.
If needed, additionally look into with statements and window functions for more complex queries:
http://www.postgresql.org/docs/9.2/static/tutorial-window.html
http://www.postgresql.org/docs/9.2/static/functions-window.html
Some testing reveals that sum() doesn't work with tsrange types.
That being said, the sql schema used in the follow-up queries:
drop table if exists nodes;
create table nodes (
zone int not null,
node int not null,
uptime tsrange
);
-- this requires the btree_gist extension:
-- alter table nodes add exclude using gist (uptime with &&, zone with =, node with =);
The data (slight variation from your sample):
insert into nodes values
(1, 1, '[2013-02-20 00:00:00, 2013-02-21 09:40:00)'),
(1, 1, '[2013-02-21 09:48:48, 2013-02-21 10:04:56)'),
(1, 1, '[2013-02-21 10:09:27, 2013-02-21 10:14:01)'),
(1, 1, '[2013-02-22 10:24:20, 2013-02-22 10:26:29)'),
(1, 1, '[2013-02-22 11:25:15, 2013-02-22 11:27:24)'),
(1, 1, '[2013-02-28 15:52:59, 2013-02-28 16:24:59)'),
(1, 1, '[2013-02-28 16:40:18, 2013-02-28 16:56:19)'),
(1, 1, '[2013-02-28 17:00:00, infinity)'),
(1, 2, '[2013-02-20 00:00:01, 2013-02-21 12:15:00)'),
(1, 2, '[2013-02-21 12:26:04, 2013-02-21 13:15:53)'),
(1, 2, '[2013-02-22 10:21:14, 2013-02-23 13:23:10)'),
(1, 2, '[2013-02-23 13:33:32, 2013-02-24 13:35:23)'),
(1, 2, '[2013-02-25 14:25:51, 2013-02-26 15:17:25)'),
(1, 2, '[2013-02-28 15:43:01, 2013-02-28 18:49:56)'),
(2, 3, '[2013-02-20 00:00:01, 2013-02-22 09:01:00)'),
(2, 3, '[2013-02-22 10:19:13, 2013-02-22 17:23:59)'),
(2, 3, '[2013-02-28 16:13:48, 2013-02-28 16:54:27)');
Raw data in order (for clarity):
select *
from nodes
order by zone, uptime, node;
Yields:
zone | node | uptime
------+------+-----------------------------------------------
1 | 1 | ["2013-02-20 00:00:00","2013-02-21 09:40:00")
1 | 2 | ["2013-02-20 00:00:01","2013-02-21 12:15:00")
1 | 1 | ["2013-02-21 09:48:48","2013-02-21 10:04:56")
1 | 1 | ["2013-02-21 10:09:27","2013-02-21 10:14:01")
1 | 2 | ["2013-02-21 12:26:04","2013-02-21 13:15:53")
1 | 2 | ["2013-02-22 10:21:14","2013-02-23 13:23:10")
1 | 1 | ["2013-02-22 10:24:20","2013-02-22 10:26:29")
1 | 1 | ["2013-02-22 11:25:15","2013-02-22 11:27:24")
1 | 2 | ["2013-02-23 13:33:32","2013-02-24 13:35:23")
1 | 2 | ["2013-02-25 14:25:51","2013-02-26 15:17:25")
1 | 2 | ["2013-02-28 15:43:01","2013-02-28 18:49:56")
1 | 1 | ["2013-02-28 15:52:59","2013-02-28 16:24:59")
1 | 1 | ["2013-02-28 16:40:18","2013-02-28 16:56:19")
1 | 1 | ["2013-02-28 17:00:00",infinity)
2 | 3 | ["2013-02-20 00:00:01","2013-02-22 09:01:00")
2 | 3 | ["2013-02-22 10:19:13","2013-02-22 17:23:59")
2 | 3 | ["2013-02-28 16:13:48","2013-02-28 16:54:27")
(17 rows)
Nodes available # 2013-02-21 09:20:00:
with upnodes as (
select zone, node, uptime
from nodes
where '2013-02-21 09:20:00'::timestamp <# uptime
)
select *
from upnodes
order by zone, uptime, node;
Yields:
zone | node | uptime
------+------+-----------------------------------------------
1 | 1 | ["2013-02-20 00:00:00","2013-02-21 09:40:00")
1 | 2 | ["2013-02-20 00:00:01","2013-02-21 12:15:00")
2 | 3 | ["2013-02-20 00:00:01","2013-02-22 09:01:00")
(3 rows)
Nodes available from 2013-02-21 00:00:00 incl to 2013-02-24 00:00:00 excl:
with upnodes as (
select zone, node, uptime
from nodes
where '[2013-02-21 00:00:00, 2013-02-24 00:00:00)'::tsrange && uptime
)
select * from upnodes
order by zone, uptime, node;
Yields:
zone | node | uptime
------+------+-----------------------------------------------
1 | 1 | ["2013-02-20 00:00:00","2013-02-21 09:40:00")
1 | 2 | ["2013-02-20 00:00:01","2013-02-21 12:15:00")
1 | 1 | ["2013-02-21 09:48:48","2013-02-21 10:04:56")
1 | 1 | ["2013-02-21 10:09:27","2013-02-21 10:14:01")
1 | 2 | ["2013-02-21 12:26:04","2013-02-21 13:15:53")
1 | 2 | ["2013-02-22 10:21:14","2013-02-23 13:23:10")
1 | 1 | ["2013-02-22 10:24:20","2013-02-22 10:26:29")
1 | 1 | ["2013-02-22 11:25:15","2013-02-22 11:27:24")
1 | 2 | ["2013-02-23 13:33:32","2013-02-24 13:35:23")
2 | 3 | ["2013-02-20 00:00:01","2013-02-22 09:01:00")
2 | 3 | ["2013-02-22 10:19:13","2013-02-22 17:23:59")
(11 rows)
Zones available from 2013-02-21 00:00:00 incl to 2013-02-24 00:00:00 excl'
with upnodes as (
select zone, node, uptime
from nodes
where '[2013-02-21 00:00:00, 2013-02-24 00:00:00)'::tsrange && uptime
),
upzones_max as (
select u1.zone, tsrange(lower(u1.uptime), max(upper(u2.uptime))) as uptime
from upnodes as u1
join upnodes as u2 on u2.zone = u1.zone and u2.uptime && u1.uptime
group by u1.zone, lower(u1.uptime)
),
upzones as (
select u1.zone, tsrange(min(lower(u2.uptime)), upper(u1.uptime)) as uptime
from upzones_max as u1
join upzones_max as u2 on u2.zone = u1.zone and u2.uptime && u1.uptime
group by u1.zone, upper(u1.uptime)
)
select zone, uptime, upper(uptime) - lower(uptime) as duration
from upzones
order by zone, uptime;
Yields:
zone | uptime | duration
------+-----------------------------------------------+-----------------
1 | ["2013-02-20 00:00:00","2013-02-21 12:15:00") | 1 day 12:15:00
1 | ["2013-02-21 12:26:04","2013-02-21 13:15:53") | 00:49:49
1 | ["2013-02-22 10:21:14","2013-02-23 13:23:10") | 1 day 03:01:56
1 | ["2013-02-23 13:33:32","2013-02-24 13:35:23") | 1 day 00:01:51
2 | ["2013-02-20 00:00:01","2013-02-22 09:01:00") | 2 days 09:00:59
2 | ["2013-02-22 10:19:13","2013-02-22 17:23:59") | 07:04:46
(6 rows)
There might be a better way to write the latter query if you write (or find) a custom aggregate function that sums overlapping range types -- the non-trivial issue that I ran into was to isolate an adequate group by clause; I ended up settling with two nested group by clauses.
The queries could also be rewritten to accommodate your current schema, either by replacing the uptime field by an expression such as tsrange(start_date, end_date), or by writing a view that does so.

DROP table if exists temptable;
CREATE TABLE temptable
(
zone_name character varying(255),
nodeid integer,
nodelabel character varying(255),
nodegainedservice timestamp with time zone,
nodelostservice timestamp with time zone
);
INSERT INTO tempTable (zone_name, nodeid, nodelabel, nodegainedservice, nodelostservice) VALUES
('Zone1', 27, 'Windows-SRV1', '2013-02-21 10:04:56+08', '2013-02-21 09:48:48+08'),
('Zone1', 27, 'Windows-SRV1', '2013-02-21 10:14:01+08', '2013-02-21 10:09:27+08'),
('Zone1', 27, 'Windows-SRV1', '2013-02-22 10:26:29+08', '2013-02-22 10:24:20+08'),
('Zone1', 27, 'Windows-SRV1', '2013-02-22 11:27:24+08', '2013-02-22 11:25:15+08'),
('Zone1', 27, 'Windows-SRV1', '2013-02-28 16:24:59+08', '2013-02-28 15:52:59+08'),
('Zone1', 27, 'Windows-SRV1', '2013-02-28 16:56:19+08', '2013-02-28 16:40:18+08'),
('Zone1', 39, 'Windows-SRV2', '2013-02-21 13:15:53+08', '2013-02-21 12:26:04+08'),
('Zone1', 39, 'Windows-SRV2', '2013-02-23 13:23:10+08', '2013-02-22 10:21:14+08'),
('Zone1', 39, 'Windows-SRV2', '2013-02-24 13:35:23+08', '2013-02-23 13:33:32+08'),
('Zone1', 39, 'Windows-SRV2', '2013-02-26 15:17:25+08', '2013-02-25 14:25:51+08'),
('Zone1', 39, 'Windows-SRV2', '2013-02-28 18:49:56+08', '2013-02-28 15:43:01+08'),
('Zone2', 13, 'Windows-SRV3', '2013-02-22 17:23:59+08', '2013-02-22 10:19:13+08'),
('Zone2', 13, 'Windows-SRV3', '2013-02-28 16:54:27+08', '2013-02-28 16:13:48+08'),
('Zone2', 14, 'Windows-SRV4', '2013-02-22 11:02:56+08', '2013-02-22 10:01:48+08');
with downodes as (
select zone_name, nodeid, nodelostservice, nodegainedservice
from temptable
WHERE (nodelostservice, nodegainedservice) OVERLAPS ('Wed Feb 20 00:00:00 +0800 2013'::TIMESTAMP, 'Fri Mar 01 00:00:00 +0800 2013'::TIMESTAMP)
),
donezones_max as(
select downodes1.zone_name, downodes1.nodeid, downodes1.nodelostservice, min(downodes2.nodegainedservice) as nodegainedservice
from downodes as downodes1
join downodes as downodes2 on downodes2.zone_name = downodes1.zone_name and ((downodes2.nodelostservice, downodes2.nodegainedservice) OVERLAPS (downodes1.nodelostservice, downodes1.nodegainedservice))
group by downodes1.zone_name, downodes1.nodeid, downodes1.nodelostservice
),
downzones as(
select downodes1.zone_name, downodes1.nodeid, max(downodes2.nodelostservice) as nodelostservice, downodes1.nodegainedservice
from donezones_max as downodes1
join donezones_max as downodes2 on downodes2.zone_name = downodes1.zone_name and ((downodes2.nodelostservice, downodes2.nodegainedservice) OVERLAPS (downodes1.nodelostservice, downodes1.nodegainedservice))
group by downodes1.zone_name, downodes1.nodeid, downodes1.nodegainedservice
),
zone_outages as(
SELECT
zone_name,
nodelostservice,
nodegainedservice,
nodegainedservice - nodelostservice AS duration,
CAST('1' AS INTEGER) as outage_counter
FROM downzones GROUP BY zone_name, nodelostservice, nodegainedservice HAVING COUNT(*) > 1 ORDER BY zone_name, nodelostservice)
select
zone_name,
EXTRACT(epoch from (SUM(duration) / (greatest(1, SUM(outage_counter))))) AS average_duration_seconds,
SUM(outage_counter) AS outage_count
FROM zone_outages GROUP BY zone_name ORDER BY zone_name

Related

Postgresql compare two rows recursively

I want to write a query where I can find track the downgraded versions for each id.
So here is the table;
id version ts
1 3 2021-09-01 10:47:50+00
1 5 2021-09-05 10:47:50+00
1 1 2021-09-11 10:47:50+00
2 2 2021-09-11 10:47:50+00
2 6 2021-09-15 10:47:50+00
3 2 2021-09-01 10:47:50+00
3 4 2021-09-05 10:47:50+00
3 6 2021-09-15 10:47:50+00
3 1 2021-09-16 10:47:50+00
I want to print out something like that;
id:1 downgraded their version from 5 to 1 at 2021-09-11 10:47:50+00
id:3 downgraded their version from 6 to 1 at 2021-09-16 10:47:50+00
So when I run the query the output should be:
id version downgraded_to ts
1 5 1 2021-09-11 10:47:50+00
3 6 1 2021-09-16 10:47:50+00
but I'm completely lost here.
Does it make sense to handle this situation in Postgresql? Is it possible to do it?

You may use lead analytic function to get the next version and compare it with current version assuming that the version is of a numeric type.
with next_vers as (
select t.*, lead(version) over(partition by id order by ts asc) as next_version
from(values
(1, 3, timestamp '2021-09-01 10:47:50'),
(1, 5, timestamp '2021-09-05 10:47:50'),
(1, 1, timestamp '2021-09-11 10:47:50'),
(2, 2, timestamp '2021-09-11 10:47:50'),
(2, 6, timestamp '2021-09-15 10:47:50'),
(3, 2, timestamp '2021-09-01 10:47:50'),
(3, 4, timestamp '2021-09-05 10:47:50'),
(3, 6, timestamp '2021-09-15 10:47:50'),
(3, 1, timestamp '2021-09-16 10:47:50')
) as t(id, version, ts)
)
select *
from next_vers
where version > next_version
id | version | ts | next_version
-: | ------: | :------------------ | -----------:
1 | 5 | 2021-09-05 10:47:50 | 1
3 | 6 | 2021-09-15 10:47:50 | 1
db<>fiddle here

Is there a way to extract data from a map(varchar, varchar) column in SQL?

The data is stored as map(varchar, varchar) and looks like this:
Date Info ID
2020-06-10 {"Price":"102.45", "Time":"09:31", "Symbol":"AAPL"} 10
2020-06-10 {"Price":"10.28", "Time":"12:31", "Symbol":"MSFT"} 10
2020-06-11 {"Price":"12.45", "Time":"09:48", "Symbol":"T"} 10
Is there a way to split up the info column and return a table where each entry has its own column?
Something like this:
Date Price Time Symbol ID
2020-06-10 102.45 09:31 AAPL 10
2020-06-10 10.28 12:31 MSFT 10
Note, there is the potential for the time column to not appear in every entry. For example, an entry can look like this:
Date Info ID
2020-06-10 {"Price":"10.28", "Symbol":"MSFT"} 10
In this case, I would like it to just fill it with a nan value
Thanks

You can use the subscript operator ([]) or the element_at function to access the values in the map. The difference between the two is that [] will fail with an error if the key is missing from the map.
WITH data(dt, info, id) AS (VALUES
(DATE '2020-06-10', map_from_entries(ARRAY[('Price', '102.45'), ('Time', '09:31'), ('Symbol','AAPL')]), 10),
(DATE '2020-06-10', map_from_entries(ARRAY[('Price', '10.28'), ('Time', '12:31'), ('Symbol','MSFT')]), 10),
(DATE '2020-06-11', map_from_entries(ARRAY[('Price', '12.45'), ('Time', '09:48'), ('Symbol','T')]), 10),
(DATE '2020-06-12', map_from_entries(ARRAY[('Price', '20.99'), ('Symbol','X')]), 10))
SELECT
dt AS "date",
element_at(info, 'Price') AS price,
element_at(info, 'Time') AS time,
element_at(info, 'Symbol') AS symbol,
id
FROM data
date | price | time | symbol | id
------------+--------+-------+--------+----
2020-06-10 | 102.45 | 09:31 | AAPL | 10
2020-06-10 | 10.28 | 12:31 | MSFT | 10
2020-06-11 | 12.45 | 09:48 | T | 10
2020-06-12 | 20.99 | NULL | X | 10

This answers the original version of the question.
If that is really a string, you can use regular expressions:
select t.*,
regexp_extract(info, '"Price":"([^"]*)"', 1) as price,
regexp_extract(info, '"Symbol":"([^"]*)"', 1) as symbol,
regexp_extract(info, '"Time":"([^"]*)"', 1) as time
from t;

Running total of positive and negative numbers where the sum cannot go below zero

This is an SQL question.
I have a column of numbers which can be positive or negative, and I'm trying to figure out a way to have a running sum of the column, but where the total cannot go below zero.
Date | Number | Desired | Actual
2020-01-01 | 8 | 8 | 8
2020-01-02 | 11 | 19 | 19
2020-01-03 | 30 | 49 | 49
2020-01-04 | -10 | 39 | 39
2020-01-05 | -12 | 27 | 27
2020-01-06 | -9 | 18 | 18
2020-01-07 | -26 | 0 | -8
2020-01-08 | 5 | 5 | -3
2020-01-09 | -23 | 0 | -26
2020-01-10 | 12 | 12 | -14
2020-01-11 | 14 | 26 | 0
I have tried a number of different window functions on this, but haven't found a way to prevent the running total from going into negative numbers.
Any help would be greatly appreciated.
EDIT - Added a date column to indicate the ordering

Unfortunately, there is no way to do this without cycling through the records one-by-one. That, in turn, requires something like a recursive CTE.
with t as (
select t.*, row_number() over (order by date) as seqnum
from mytable t
),
cte as (
select NULL as number, 0 as desired, 0 as seqnum
union all
select t.number,
(case when cte.desired + t.number < 0 then 0
else cte.desired + t.number
end),
cte.seqnum + 1
from cte join
t
on t.seqnum = cte.seqnum + 1
)
select cte.*
from cte
where cte.number is not null;
I would recommend this approach only if your data is rather small. But then again, if you have to do this, there are not many alternatives other then going through the table row-by-agonizing-row.
Here is a db<>fiddle (using Postgres).

You can use a CASE operator and the SIGN function to do so…
CASE SIGN(my computed expression) WHEN -1 THEN 0 ELSE my computed expression END AS Actual

This can be done via a USER DEFINE TABLE FUNCTION to "manage" the state you want to carry
CREATE OR REPLACE FUNCTION non_neg_sum(val float) RETURNS TABLE (out_sum float)
LANGUAGE JAVASCRIPT AS
'{
processRow: function (row, rowWriter) {
this.sum += row.VAL;
if(this.sum < 0)
this.sum = 0;
rowWriter.writeRow({OUT_SUM: this.sum})
},
initialize: function() {
this.sum = 0;
}
}';
And used like so:
WITH input AS
(
SELECT *
FROM VALUES ('2020-01-01', 8, 8),
('2020-01-02', 11, 19 ),
('2020-01-03', 30, 49 ),
('2020-01-04',-10, 39 ),
('2020-01-05',-12, 27 ),
('2020-01-06', -9, 18 ),
('2020-01-07',-26, 0 ),
('2020-01-08', 5, 5 ),
('2020-01-09',-23, 0 ),
('2020-01-10', 12, 12 ),
('2020-01-11', 14, 26 ) d(day,num,wanted)
)
SELECT d.*
,sum(d.num)over(order by day) AS simple_sum
,j.*
FROM input AS d,
TABLE(non_neg_sum(d.num::float) OVER (ORDER BY d.day)) j
ORDER BY day
;
gives the results:
DAY NUM WANTED SIMPLE_SUM OUT_SUM
2020-01-01 8 8 8 8
2020-01-02 11 19 19 19
2020-01-03 30 49 49 49
2020-01-04 -10 39 39 39
2020-01-05 -12 27 27 27
2020-01-06 -9 18 18 18
2020-01-07 -26 0 -8 0
2020-01-08 5 5 -3 5
2020-01-09 -23 0 -26 0
2020-01-10 12 12 -14 12
2020-01-11 14 26 0 26

Another UDF solution:
select d, x, conditional_sum(x) from values
('2020-01-01', 8),
('2020-01-02', 11),
('2020-01-03', 30),
('2020-01-04', -10),
('2020-01-05', -12),
('2020-01-06', -9),
('2020-01-07', -26),
('2020-01-08', 5),
('2020-01-09', -23),
('2020-01-10', 12),
('2020-01-11', 14)
t(d,x)
order by d;
where conditional_sum is defined as:
create or replace function conditional_sum(X float)
returns float
language javascript
volatile
as
$$
if (!('sum' in this)) this.sum = 0
return this.sum = (X+this.sum)<0 ? 0 : this.sum+X
$$;

Demo :
WITH input AS
( SELECT *
FROM (VALUES
('2020-01-01', 8, 8),
('2020-01-02', 11, 19 ),
('2020-01-03', 30, 49 ),
('2020-01-04',-10, 39 ),
('2020-01-05',-12, 27 ),
('2020-01-06', -9, 18 ),
('2020-01-07',-26, 0 ),
('2020-01-08', 5, 5 ),
('2020-01-09',-23, 0 ),
('2020-01-10', 12, 12 ),
('2020-01-11', 14, 26 ),
('2020-01-12', 3, 26 )) AS d (day,num,wanted)
)
SELECT *, sum(num)over(order by day) AS CUM_SUM,
CASE SIGN(sum(num)over(order by day))
WHEN -1 THEN 0
ELSE sum(num)over(order by day)
END AS Actual
FROM input
ORDER BY day;
Return :
day num wanted CUM_SUM Actual
---------- ----------- ----------- ----------- -----------
2020-01-01 8 8 8 8
2020-01-02 11 19 19 19
2020-01-03 30 49 49 49
2020-01-04 -10 39 39 39
2020-01-05 -12 27 27 27
2020-01-06 -9 18 18 18
2020-01-07 -26 0 -8 0
2020-01-08 5 5 -3 0
2020-01-09 -23 0 -26 0
2020-01-10 12 12 -14 0
2020-01-11 14 26 0 0
2020-01-12 3 26 3 3
I add one more row to your test values… to demonstrate the final conditionnal sum is 3

Determine the end time continuously based on the average duration

I'm trying to predict the start / end time of an order's processes in SQL. I have determined the average duration for processes from the past.
The processes run in several parallel rows (RNr) and rows are independent of each other. Each row can have 1-30 processes (PNr) that have different durations. The duration of a process may vary and is known only as an average duration.
After one process is completed, the next automatically starts.
So PNr 1 finish = PNr 2 start.
The start time of the first process in each row is known at the beginning and is the same for each row.
When some processes are completed, the times are known and should be used to calculate the more accurate prediction of upcoming processes.
How can I predict the time when a process will be started or stopped?
I used an large subquery to get this table.
RNr PNr Duration_avg_h Start Finish
1 1 1 2019-06-06 16:32:11 2019-06-06 16:33:14
1 2 262 2019-06-06 16:33:14 NULL
1 3 51 NULL NULL
1 4 504 NULL NULL
1 5 29 NULL NULL
2 1 1 2019-06-06 16:32:11 NULL
2 2 124 NULL NULL
2 3 45 NULL NULL
2 4 89 NULL NULL
2 5 19 NULL NULL
2 6 1565 NULL NULL
2 7 24 NULL NULL
Now I want to find the values for the prediction.
SELECT
RNr,
PNr,
Duration_avg_h,
Start,
Finish,
Predicted_Start = CASE
WHEN Start IS NULL
THEN DATEADD(HH,LAG(Duration_avg_h, 1,NULL) OVER (ORDER BY RNr,PNr), LAG(Start, 1,NULL) OVER (ORDER BY RNr,PNr))
ELSE Start END,
Predicted_Finish = CASE
WHEN Finish IS NULL
THEN DATEADD(HH,Duration_avg_h,Start)
ELSE Finish END,
SUM(Duration_avg_h) over (PARTITION BY RNr ORDER BY RNr, PNr ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Duration_row_h
FROM (...)
ORDER BY RNr, PNr
I tried LAG () but with that I only get the values for the next line. I also came to no conclusion with "ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW".
RNr PNr Duration_avg_h Start Finish Predicted_Start Predicted_Finish Duration_row_h
1 1 1 2019-06-06 16:32:11 2019-06-06 16:33:14 2019-06-06 16:32:11 2019-06-06 16:33:14 1
1 2 262 2019-06-06 16:33:14 NULL 2019-06-06 16:33:14 2019-06-17 14:33:14 263
1 3 51 NULL NULL 2019-06-17 14:33:14 NULL 314
1 4 504 NULL NULL NULL NULL 818
1 5 29 NULL NULL NULL NULL 847
2 1 1 2019-06-06 16:32:11 NULL 2019-06-06 16:32:11 2019-06-06 17:32:11 1
2 2 124 NULL NULL 2019-06-06 17:32:11 NULL 125
2 3 45 NULL NULL NULL NULL 170
2 4 89 NULL NULL NULL NULL 259
2 5 19 NULL NULL NULL NULL 278
So can somebody help me to fill the columns Predicted_Start and Predicted_Finish ?

LAG only works if all your rows have values. For this use case you need to cascade the results from one row to another. One way of doing this is with a self join to get running totals
--Sample Data
DECLARE #dataset TABLE
(
RNr INT
,PNr INT
,Duration_avg_h INT
,START DATETIME
,Finish DATETIME
)
INSERT INTO #dataset
(
RNr
,PNr
,Duration_avg_h
,START
,Finish
)
VALUES
(1, 1, 1, '2019-06-06 16:32:11',NULL)
,(1, 2, 262, NULL,NULL)
,(1, 3, 51, NULL,NULL)
,(1, 4, 504, NULL,NULL)
,(1, 5, 29, NULL,NULL)
,(2, 1, 1, '2019-06-06 16:32:11', NULL)
,(2, 2, 124, NULL,NULL)
,(2, 3, 45, NULL,NULL)
,(2, 4, 89, NULL,NULL)
,(2, 5, 19, NULL,NULL)
,(2, 6, 1565, NULL,NULL)
,(2, 7, 24, NULL,NULL)
SELECT
d.RNr,
d.PNr,
d.Duration_avg_h,
d.Start,
d.Finish,
--SUM() gives us the total time up to and including this step
--take of the current step and you get the total time of all the previous steps
--this can give us our start time, or when the previous step ended.
SUM(running_total.Duration_avg_h) - d.Duration_avg_h AS running_total_time,
--MIN() gives us the lowest start time we have pre process.
MIN(running_total.Start) AS min_start,
ISNULL(
d.Start
,DATEADD(HH,SUM(running_total.Duration_avg_h),MIN(running_total.Start) )
) AS Predicted_Start,
ISNULL(
d.Finish
,DATEADD(HH,SUM(running_total.Duration_avg_h),MIN(running_total.Start) )
) AS Predicted_Finish
FROM #dataset AS d
LEFT JOIN #dataset AS running_total
ON d.RNr = running_total.RNr
AND
--the running total for all steps.
running_total.PNr <= d.PNr
GROUP BY
d.RNr,
d.PNr,
d.Duration_avg_h,
d.Start,
d.Finish
ORDER BY
RNr,
PNr
This code will not work once you have actual finish times unless you update the Duration_avg_h to be the actual hours taken.

Jonathan, thanks for your help.
Your idea of using "MIN (running_total.Start) AS min_start," brought me to the idea of using "MAX (d.Start) OVER (PARTITION BY RNr)". And this resulted in the following query:
--Sample Data
DECLARE #dataset TABLE
(
RNr INT
,PNr INT
,Duration_avg_h INT
,START DATETIME
,Finish DATETIME
)
INSERT INTO #dataset
(
RNr
,PNr
,Duration_avg_h
,START
,Finish
)
VALUES
(1, 1, 1, '2019-06-06 16:32:11','2019-06-06 16:33:14')
,(1, 2, 262, '2019-06-06 16:33:14','2019-08-22 17:30:00')
,(1, 3, 51, '2019-08-22 17:30:00',NULL)
,(1, 4, 504, NULL,NULL)
,(1, 5, 29, NULL,NULL)
,(2, 1, 1, '2019-06-06 16:32:11', NULL)
,(2, 2, 124, NULL,NULL)
,(2, 3, 45, NULL,NULL)
,(2, 4, 89, NULL,NULL)
,(2, 5, 19, NULL,NULL)
,(2, 6, 1565, NULL,NULL)
,(2, 7, 24, NULL,NULL)
SELECT RNr,
PNr,
Duration_avg_h,
Start,
Finish,
--Start_max,
--Finish_bit,
--Duration_row_h,
CASE WHEN Start IS NOT NULL THEN Start ELSE DATEADD(HH,(Duration_row_h - MAX(Duration_row_h*Finish_bit) OVER (PARTITION BY RNr) - Duration_avg_h), Start_max) END as Predicted_Start,
CASE WHEN Finish IS NOT NULL THEN Finish ELSE DATEADD(HH,(Duration_row_h - MAX(Duration_row_h*Finish_bit) OVER (PARTITION BY RNr)), Start_max) END as Predicted_Finish
FROM ( SELECT
RNr,
PNr,
Duration_avg_h,
--Convert to a short DATETIME format
CONVERT(DATETIME2(0),Start) as Start,
CONVERT(DATETIME2(0),Finish) as Finish,
--Get MAX start time for each row
Start_max = MAX (CONVERT(DATETIME2(0),d.Start)) OVER (PARTITION BY RNr),
--If process is finished then 1
Finish_bit = (CASE WHEN d.Finish IS NULL THEN 0 ELSE 1 END),
--continuously count the Duration of all processes in the row
SUM(Duration_avg_h) over (PARTITION BY RNr ORDER BY RNr, PNr ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Duration_row_h
FROM #dataset AS d
) AS e
ORDER BY
RNr,
PNr
This query takes into account changes in start and stop times. And calculated by that, the prediction for the upcoming processes.
RNr PNr Duration_avg_h Start Finish Predicted_Start Predicted_Finish
1 1 1 2019-06-06 16:32:11 2019-06-06 16:33:14 2019-06-06 16:32:11 2019-06-06 16:33:14
1 2 262 2019-06-06 16:33:14 2019-08-22 17:30:00 2019-06-06 16:33:14 2019-08-22 17:30:00
1 3 51 2019-08-22 17:30:00 NULL 2019-08-22 17:30:00 2019-08-24 20:30:00
1 4 504 NULL NULL 2019-08-24 20:30:00 2019-09-14 20:30:00
1 5 29 NULL NULL 2019-09-14 20:30:00 2019-09-16 01:30:00
2 1 1 2019-06-06 16:32:11 NULL 2019-06-06 16:32:11 2019-06-06 17:32:11
2 2 124 NULL NULL 2019-06-06 17:32:11 2019-06-11 21:32:11
2 3 45 NULL NULL 2019-06-11 21:32:11 2019-06-13 18:32:11
2 4 89 NULL NULL 2019-06-13 18:32:11 2019-06-17 11:32:11
2 5 19 NULL NULL 2019-06-17 11:32:11 2019-06-18 06:32:11
2 6 1565 NULL NULL 2019-06-18 06:32:11 2019-08-22 11:32:11
2 7 24 NULL NULL 2019-08-22 11:32:11 2019-08-23 11:32:11
I think this way is still complicated. Does anyone know maybe a simpler query?

Sql Server 2012 - Group data by varying timeslots

I have some data to analyze which is at half hour granularity, but would like to group it by 2, 3, 6, 12 hour and 2 days and 1 week to make some more meaningful comparisons.
|DateTime | Value |
|01 Jan 2013 00:00 | 1 |
|01 Jan 2013 00:30 | 1 |
|01 Jan 2013 01:00 | 1 |
|01 Jan 2013 01:30 | 1 |
|01 Jan 2013 02:00 | 2 |
|01 Jan 2013 02:30 | 2 |
|01 Jan 2013 03:00 | 2 |
|01 Jan 2013 03:30 | 2 |
Eg. 2 hour grouped result will be
|DateTime | Value |
|01 Jan 2013 00:00 | 4 |
|01 Jan 2013 02:00 | 8 |
To get the 2 hourly grouped result, I thought of this code -
CASE
WHEN DatePart(HOUR,DateTime)%2 = 0 THEN
CAST(CAST(DatePart(HOUR,DateTime) AS varchar) + '':00'' AS TIME)
ELSE
CAST(CAST(DATEPART(HOUR,DateTime) As Int) - 1 AS varchar) + '':00'' END Time
This seems to work alright, but I cant think of using this to generalize to 3, 6, 12 hours.
I can for 6, 12 hours just use case statements and achieve result but is there any way to generalize so that I can achieve 2,3,6,12 hour granularity and also 2 days and a week level granularity? By generalize, I mean I could pass on a variable with desired granularity to the same query rather than having to constitute different queries with different case statements.
Is this possible? Please provide some pointers.
Thanks a lot!

I think you can use
Declare #Resolution int = 3 -- resolution in hours
Select
DateAdd(Hour,
DateDiff(Hour, 0, datetime) / #Resolution * #Resolution, -- integer arithmetic
0) as bucket,
Sum(values)
From
table
Group By
DateAdd(Hour,
DateDiff(Hour, 0, datetime) / #Resolution * #Resolution, -- integer arithmetic
0)
Order By
bucket
This calculates the number of hours since a known fixed date, rounds down to the resolution size you're interested in, then adds them back on to the fixed date.
It will miss buckets out, though if you have no data in them
Example Fiddle

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas