I have a table of time spans that overlap each other. I want to generate a table that covers the same time spans but doesn't overlap.
For example, say I have a table like this:
Start,End
1, 4
3, 5
7, 8
2, 4
I want a new table like this:
Start,End
1, 5
7, 8
What is the SQL query to do this?
Tested on spark-sql version 1.5.2.
(and with small changes - on Teradata, Oracle, PostgreSQL and SQL Server)
In order to guarantee the correctness of this solution the order by clauses in the two analytic functions should be identical and deterministic, so if you have an Id column use order by `Start`,`Id` instead of order by `Start`,`End`
select min(`Start`) as `Start`
,max(`End`) as `End`
from (select `Start`,`End`
,count(is_gap) over
(
order by `Start`,`End`
rows unbounded preceding
) + 1 as range_seq
from (select `Start`,`End`
,case
when max(`End`) over
(
order by `Start`,`End`
rows between unbounded preceding
and 1 preceding
) < `Start`
then 1
end is_gap
from mytable
) t
) t
group by range_seq
order by `Start`
+-------+-----+
| Start | End |
+-------+-----+
| 1 | 5 |
+-------+-----+
| 7 | 8 |
+-------+-----+
Related
I want to fetch the difference in "Data" column between two consecutive rows. For example, need Row2-Row1 ( 1902.4-1899.66) , Row 3-Row 2 and so on. The difference should be stored in a new column.
+----+-------+-----------+-------------------------+----+
| Name | Data |meter| Time |
+----+-------+-----------+-------------------------+----+
| Boiler-1 | 1899.66 | 1 | 5/16/2019 12:00:00 AM |
| Boiler-1 | 1902.4 | 1 | 5/16/2019 12:15:00 AM |
| Boiler-1 | 1908.1 | 1 | 5/16/2019 12:15:00 AM |
| Boiler-1 | 1911.7 | 6 | 5/16/2019 12:15:00 AM |
| Boiler-1 | 1926.4 | 6 | 5/16/2019 12:15:00 AM |
|
+----+-------+-----------+------------------------- +
Thing is the table structure that I have shown in the question, is actually obtained from two different tables. I mean, the above table is a result of a Select query to get data from two different tables. Goes like "select name, data, unitId, Timestamp from table t1 join table t2....." So is there anyway for me to calculate the difference in "data" column value between consecutive rows, without storing this above shown result into a table?
I use SQL 2008, so Lead/Lag functionality cannot be used.
The equivalent in SQL Server 2008 uses apply -- and it can be expensive:
with t as (
<your query here>
)
select t.*,
(t.data - tprev.data) as diff
from t outer apply
(select top (1) tprev.*
from t tprev
where tprev.name = t.name and
tprev.boiler = t.boiler and
tprev.time < t.time
order by tprev.time desc
) tprev;
This assumes that you want the previous row when the name and boiler are the same. You can adjust the correlation clause if you have different groupings in mind.
Not claiming that this is best, this is just another option in SQL SERVER < 2012. As from SQL Server 2012 its easy to do the same using LEAD and LAG default option added. Any way, for small and medium data set, you can consider this below script as well :)
Note: This is just an Idea for you.
WITH CTE(Name,Data)
AS
(
SELECT 'Boiler-1' ,1899.66 UNION ALL
SELECT 'Boiler-1',1902.4 UNION ALL
SELECT 'Boiler-1',1908.1 UNION ALL
SELECT 'Boiler-1',1911.7 UNION ALL
SELECT 'Boiler-1',1926.4
--Replace above select statement with your query
)
SELECT A.Name,A.Data,A.Data-ISNULL(B.Data,0) AS [Diff]
FROM
(
--Adding ROW_NUMBER Over (SELECT NULL) will keep the natural order
--of your data and will just add the row number.
SELECT *,ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) RN FROM CTE
)A
LEFT JOIN
(
SELECT *,ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) RN FROM CTE
) B
--Here the JOINING will take place on curent and next row for using ( = B.RN-1)
ON A.RN = B.RN-1
Using PostgreSQL v9.4.5 from Shell I created a database called moments in psql by running create database moments. I then created a moments table:
CREATE TABLE moments
(
id SERIAL4 PRIMARY KEY,
moment_type BIGINT NOT NULL,
flag BIGINT NOT NULL,
time TIMESTAMP NOT NULL,
UNIQUE(moment_type, time)
);
INSERT INTO moments (moment_type, flag, time) VALUES (1, 7, '2016-10-29 12:00:00');
INSERT INTO moments (moment_type, flag, time) VALUES (1, -30, '2016-10-29 13:00:00');
INSERT INTO moments (moment_type, flag, time) VALUES (3, 5, '2016-10-29 14:00:00');
INSERT INTO moments (moment_type, flag, time) VALUES (2, 9, '2016-10-29 18:00:00');
INSERT INTO moments (moment_type, flag, time) VALUES (2, -20, '2016-10-29 17:00:00');
INSERT INTO moments (moment_type, flag, time) VALUES (3, 10, '2016-10-29 16:00:00');
I run select * from moments to view the table:
Moments Table
id | moment_type | flag | time
----+-------------+------+---------------------
1 | 1 | 7 | 2016-10-29 12:00:00
2 | 1 | -30 | 2016-10-29 13:00:00
3 | 3 | 5 | 2016-10-29 14:00:00
4 | 2 | 9 | 2016-10-29 18:00:00
5 | 2 | -20 | 2016-10-29 17:00:00
6 | 3 | 10 | 2016-10-29 16:00:00
I then try to write an SQL query that produces the following output, whereby for each pair of duplicate moment_type values it returns the difference between the flag value of the moment_type having the most recent timestamp value, and the flag value of the second most recent timestamp value, and lists the results in ascending order by moment_type.
Expected SQL Query Output
moment_type | flag |
------------+------+
1 | -37 | (i.e. -30 - 7)
2 | 29 | (i.e. 9 - -20)
3 | 5 | (i.e. 10 - 5)
The SQL query that I came up with is as follows, which uses the WITH query to write multiple Common Table Expressions (CET) subqueries for use as temporary tables in the larger SELECT query at the end. I also use an SQL function to calculate the difference between two of the subquery outputs (alternatively I think I could have just used DIFFERENCE DIFFERENCE(most_recent_flag, second_most_recent_flag) AS flag instead of the function):
CREATE FUNCTION difference(most_recent_flag, second_most_recent_flag) RETURNS numeric AS $$
SELECT $1 - $2;
$$ LANGUAGE SQL;
-- get two flags that have the most recent timestamps
WITH two_most_recent_flags AS (
SELECT moments.flag
FROM moments
ORDER BY moments.time DESC
LIMIT 2
),
-- get one flag that has the most recent timestamp
most_recent_flag AS (
SELECT *
FROM two_most_recent_flags
ORDER BY flag DESC
LIMIT 1
),
-- get one flag that has the second most recent timestamp
second_most_recent_flag AS (
SELECT *
FROM two_most_recent_flags
ORDER BY flag ASC
LIMIT 1
)
SELECT DISTINCT ON (moments.moment_type)
moments.moment_type,
difference(most_recent_flag, second_most_recent_flag) AS flag
FROM moments
ORDER BY moment_type ASC
LIMIT 2;
But when I run the above SQL query in PostgreSQL, it returns the following error:
ERROR: column "most_recent_flag" does not exist
LINE 21: difference(most_recent_flag, second_most_recent_flag) AS fla...
Question
What techniques can I use and how may I apply them to overcome this error, and calculate and display the differences in the flag column to achieve the Expected SQL Query Output?
Note: Perhaps the Window Function may be used somehow as it performs calculations across table rows
Use the lag() window function:
select moment_type, difference
from (
select *, flag- lag(flag) over w difference
from moments
window w as (partition by moment_type order by time)
) s
where difference is not null
order by moment_type
moment_type | difference
-------------+------------
1 | -37
2 | 29
3 | 5
(3 rows)
One method is to use conditional aggregation. The window function row_number() can be used to identify the first and last time values:
select m.moment_type,
(max(case when seqnum_desc = 1 then flag end) -
min(case when seqnum_asc = 1 then flag end)
)
from (select m.*,
row_number() over (partition by m.moment_type order by m.time) as seqnum_asc,
row_number() over (partition by m.moment_type order by m.time desc) as seqnum_desc
from moments m
) m
group by m.moment_type;
I would like to determine the number of consecutive absences as per the following table. Initial research suggests I may be able to achieve this using a window function. For the data provided, the longest streak is four consecutive occurrences. Please can you advise how I can set a running absence total as a separate column.
create table events (eventdate date, absence int);
insert into events values ('2014-10-01', 0);
insert into events values ('2014-10-08', 1);
insert into events values ('2014-10-15', 1);
insert into events values ('2014-10-22', 0);
insert into events values ('2014-11-05', 0);
insert into events values ('2014-11-12', 1);
insert into events values ('2014-11-19', 1);
insert into events values ('2014-11-26', 1);
insert into events values ('2014-12-03', 1);
insert into events values ('2014-12-10', 0);
Based on Gordon Linhoff's answer here, you could do:
SELECT TOP 1
MIN(eventdate) AS spanStart ,
MAX(eventdate) AS spanEnd,
COUNT(*) AS spanLength
FROM ( SELECT e.* ,
( ROW_NUMBER() OVER ( ORDER BY eventdate )
- ROW_NUMBER() OVER ( PARTITION BY absence ORDER BY eventdate ) ) AS grp
FROM #events e
) t
GROUP BY grp ,
absence
HAVING absence = 1
ORDER BY COUNT(*) DESC;
Which returns:
spanStart | spanEnd | spanLength
---------------------------------------
2014-11-12 |2014-12-03 | 4
You don't specify which RDBMS you are using, but the following works under postgresql's window functions and should be translatable to similar SQL engines:
SELECT eventdate,
absence,
-- XXX We take advantage of the fact that absence is an int (1 or 0)
-- otherwise we'd COUNT(1) OVER (...) and only conditionally
-- display the count if absence = 1
SUM(absence) OVER (PARTITION BY span ORDER BY eventdate)
AS consecutive_absences
FROM (SELECT spanstarts.*,
SUM(newspan) OVER (ORDER BY eventdate) AS span
FROM (SELECT events.*,
CASE LAG(absence) OVER (ORDER BY eventdate)
WHEN absence THEN NULL
ELSE 1 END AS newspan
FROM events)
spanstarts
) eventsspans
ORDER BY eventdate;
which gives you:
eventdate | absence | consecutive_absences
------------+---------+----------------------
2014-10-01 | 0 | 0
2014-10-08 | 1 | 1
2014-10-15 | 1 | 2
2014-10-22 | 0 | 0
2014-11-05 | 0 | 0
2014-11-12 | 1 | 1
2014-11-19 | 1 | 2
2014-11-26 | 1 | 3
2014-12-03 | 1 | 4
2014-12-10 | 0 | 0
There is an excellent dissection of the above approach on the pgsql-general mailing list. The short of it is:
Innermost query (spanstarts) uses LAG to find the start of new
spans of absences, whether a span of 1's or a span 0's
Next query (eventsspans) identifies those spans by summing the number of new spans that have come before us. So, we find span 1, then span 2, then 3, etc.
The outer query the counts the number of absences in each span.
As the SQL comment says, we cheat a little bit on #3, taking advantage of its data type, but the net effect is the same.
I don't know what your DBMS is but this is from SQLServer. Hopefully it is of some help : )
-------------------------------------------------------------------------------------------
Query:
--tableRN is used to get the rownumber
;with tableRN as (SELECT a.*, ROW_NUMBER() OVER (ORDER BY a.event) as rn, COUNT(*) as maxRN
FROM absence a GROUP BY a.event, a.absence),
--cte is a recursive function that returns the...
--absence value, the level (amount of times 1 appeared in a row)
--rn (row number), total (total count
cte (absence, level, rn, total) AS (
SELECT 0, 0, 1, 0
UNION ALL
SELECT r.absence,
CASE WHEN c.absence = 1 AND r.absence = 1 THEN level + 1
ELSE 0
END,
c.rn + 1,
CASE WHEN c.level = 1 THEN total + 1
ELSE total
END
FROM cte c JOIN tableRN r ON c.rn + 1 = r.rn)
--This gets you the total count of times there
--was a consective absent (twice or more in a row).
SELECT MAX(c.total) AS Count FROM cte c
-------------------------------------------------------------------------------------------
Results:
|Count|
+-----+
| 2 |
Create a new column called consecutive_absence_count with default 0.
You may write a SQL procedure for insert - Fetch the latest record, retrieve the absence value, identify if the new record to be inserted has a present or an absent value.
If they latest and the new record have consecutive dates and absence value set to 0, increment the consecutive_absence_count else set it to 0.
I have a table which has an offset and a qty column.
I now want to create a view from that which has an entry for each precise position.
Table:
offset | qty | more_data
-------+---------+-------------
1 | 3 | 'qwer'
2 | 2 | 'asdf'
View:
position | more_data
---------+------------
1 | 'quer'
2 | 'quer'
3 | 'quer'
2 | 'asdf'
3 | 'asdf'
Is that even possible?
I would need to do that for Oracle (8! - 11), MS SQL (2005-) and PostgreSQL (8-)
Based on you input/output:
with t(offset, qty) as (
select 1, 3 from dual
)
select offset + level - 1 position
from t
connect by rownum <= qty
POSITION
--------
1
2
3
For Postgres:
select offst, generate_series(offst, qty) as position
from the_table
order by offst, num;
SQLFiddle: http://sqlfiddle.com/#!10/e70d9/4
I don't have anything as ancient as 8.0, 8.1 or 8.2 around but it should work on those pre-historic versions as well.
Note that offset is a reserved word in Postgres. You should find a different name for that column
In Oracle, to answer the specific question (i.e. a table with just the one row):
select rn posn from (
select offset-1+rownum rn from the_table
connect by level between offset and qty
);
In reality, your table will have multiple rows, so you will need to restrict the inner query to 1 object row, otherwise I think you will get huge, incorrect output. If you can provide more details about the table/data a more complete answer could be given.
This answer to shows how to produce High/Low/Open/Close values from a ticker:
Retrieve aggregates for arbitrary time intervals
I am trying to implement a solution based on this (PG 9.2), but am having difficulty in getting the correct value for first_value().
So far, I have tried two queries:
SELECT
cstamp,
price,
date_trunc('hour',cstamp) AS h,
floor(EXTRACT(minute FROM cstamp) / 5) AS m5,
min(price) OVER w,
max(price) OVER w,
first_value(price) OVER w,
last_value(price) OVER w
FROM trades
Where date_trunc('hour',cstamp) = timestamp '2013-03-29 09:00:00'
WINDOW w AS (
PARTITION BY date_trunc('hour',cstamp), floor(extract(minute FROM cstamp) / 5)
ORDER BY date_trunc('hour',cstamp) ASC, floor(extract(minute FROM cstamp) / 5) ASC
)
ORDER BY cstamp;
Here's a piece of the result:
cstamp price h m5 min max first last
"2013-03-29 09:19:14";77.00000;"2013-03-29 09:00:00";3;77.00000;77.00000;77.00000;77.00000
"2013-03-29 09:26:18";77.00000;"2013-03-29 09:00:00";5;77.00000;77.80000;77.80000;77.00000
"2013-03-29 09:29:41";77.80000;"2013-03-29 09:00:00";5;77.00000;77.80000;77.80000;77.00000
"2013-03-29 09:29:51";77.00000;"2013-03-29 09:00:00";5;77.00000;77.80000;77.80000;77.00000
"2013-03-29 09:30:04";77.00000;"2013-03-29 09:00:00";6;73.99004;77.80000;73.99004;73.99004
As you can see, 77.8 is not what I believe is the correct value for first_value(), which should be 77.0.
I though this might be due to the ambiguous ORDER BY in the WINDOW, so I changed this to
ORDER BY cstamp ASC
but this appears to upset the PARTITION as well:
cstamp price h m5 min max first last
"2013-03-29 09:19:14";77.00000;"2013-03-29 09:00:00";3;77.00000;77.00000;77.00000;77.00000
"2013-03-29 09:26:18";77.00000;"2013-03-29 09:00:00";5;77.00000;77.00000;77.00000;77.00000
"2013-03-29 09:29:41";77.80000;"2013-03-29 09:00:00";5;77.00000;77.80000;77.00000;77.80000
"2013-03-29 09:29:51";77.00000;"2013-03-29 09:00:00";5;77.00000;77.80000;77.00000;77.00000
"2013-03-29 09:30:04";77.00000;"2013-03-29 09:00:00";6;77.00000;77.00000;77.00000;77.00000
since the values for max and last now vary within the partition.
What am I doing wrong? Could someone help me better to understand the relation between PARTITION and ORDER within a WINDOW?
Although I have an answer, here's a trimmed-down pg_dump which will allow anyone to recreate the table. The only thing that's different is the table name.
CREATE TABLE wtest (
cstamp timestamp without time zone,
price numeric(10,5)
);
COPY wtest (cstamp, price) FROM stdin;
2013-03-29 09:04:54 77.80000
2013-03-29 09:04:50 76.98000
2013-03-29 09:29:51 77.00000
2013-03-29 09:29:41 77.80000
2013-03-29 09:26:18 77.00000
2013-03-29 09:19:14 77.00000
2013-03-29 09:19:10 77.00000
2013-03-29 09:33:50 76.00000
2013-03-29 09:33:46 76.10000
2013-03-29 09:33:15 77.79000
2013-03-29 09:30:08 77.80000
2013-03-29 09:30:04 77.00000
\.
SQL Fiddle
All the functions you used act on the window frame, not on the partition. If omitted the frame end is the current row. To make the window frame to be the whole partition declare it in the frame clause (range...):
SELECT
cstamp,
price,
date_trunc('hour',cstamp) AS h,
floor(EXTRACT(minute FROM cstamp) / 5) AS m5,
min(price) OVER w,
max(price) OVER w,
first_value(price) OVER w,
last_value(price) OVER w
FROM trades
Where date_trunc('hour',cstamp) = timestamp '2013-03-29 09:00:00'
WINDOW w AS (
PARTITION BY date_trunc('hour',cstamp) , floor(extract(minute FROM cstamp) / 5)
ORDER BY cstamp
range between unbounded preceding and unbounded following
)
ORDER BY cstamp;
Here's a quick query to illustrate the behaviour:
select
v,
first_value(v) over w1 f1,
first_value(v) over w2 f2,
first_value(v) over w3 f3,
last_value (v) over w1 l1,
last_value (v) over w2 l2,
last_value (v) over w3 l3,
max (v) over w1 m1,
max (v) over w2 m2,
max (v) over w3 m3,
max (v) over () m4
from (values(1),(2),(3),(4)) t(v)
window
w1 as (order by v),
w2 as (order by v rows between unbounded preceding and current row),
w3 as (order by v rows between unbounded preceding and unbounded following)
The output of the above query can be seen here (SQLFiddle here):
| V | F1 | F2 | F3 | L1 | L2 | L3 | M1 | M2 | M3 | M4 |
|---|----|----|----|----|----|----|----|----|----|----|
| 1 | 1 | 1 | 1 | 1 | 1 | 4 | 1 | 1 | 4 | 4 |
| 2 | 1 | 1 | 1 | 2 | 2 | 4 | 2 | 2 | 4 | 4 |
| 3 | 1 | 1 | 1 | 3 | 3 | 4 | 3 | 3 | 4 | 4 |
| 4 | 1 | 1 | 1 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
Few people think of the implicit frames that are applied to window functions that take an ORDER BY clause. In this case, windows are defaulting to the frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Think about it this way:
On the row with v = 1 the ordered window's frame spans v IN (1)
On the row with v = 2 the ordered window's frame spans v IN (1, 2)
On the row with v = 3 the ordered window's frame spans v IN (1, 2, 3)
On the row with v = 4 the ordered window's frame spans v IN (1, 2, 3, 4)
If you want to prevent that behaviour, you have two options:
Use an explicit ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING clause for ordered window functions
Use no ORDER BY clause in those window functions that allow for omitting them (as MAX(v) OVER())
More details are explained in this article about LEAD(), LAG(), FIRST_VALUE() and LAST_VALUE()
The result of max() as window function is base on the frame definition.
The default frame definition (with ORDER BY) is from the start of the frame up to the last peer of the current row (including the current row and possibly more rows ranking equally according to ORDER BY). In the absence of ORDER BY (like in my answer you are referring to), or if ORDER BY treats every row in the partition as equal (like in your first example), all rows in the partition are peers, and max() produces the same result for every row in the partition, effectively considering all rows of the partition.
Per documentation:
The default framing option is RANGE UNBOUNDED PRECEDING, which is the
same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With ORDER BY,
this sets the frame to be all rows from the partition start
up through the current row's last peer. Without ORDER BY, all rows of the
partition are included in the window frame, since all rows become
peers of the current row.
Bold emphasis mine.
The simple solution would be to omit the ORDER BY in the window definition - just like I demonstrated in the example you are referring to.
All the gory details about frame specifications in the chapter Window Function Calls in the manual.