Merge continuous rows with Postgresql - sql

I have a slots table like this :
Column | Type |
------------+-----------------------------+
id | integer |
begin_at | timestamp without time zone |
end_at | timestamp without time zone |
user_id | integer |
and I like to select merged rows for continuous time. Let's say I have (simplified) data like :
(1, 5:15, 5:30, 1)
(2, 5:15, 5:30, 2)
(3, 5:30, 5:45, 2)
(4, 5:45, 6:00, 2)
(5, 8:15, 8:30, 2)
(6, 8:30, 8:45, 2)
I would like to know if it's possible to select rows formatted like :
(5:15, 5:30, 1)
(5:15, 6:00, 2) // <======= rows id 2,3 and 4 merged
(8:15, 8:45, 2) // <======= rows id 5 and 6 merged
EDIT:
Here's the SQLfiddle
I'm using Postgresql, version 9.3!
Thank you!

Here is one method for solving this problem. Create a flag that determines if a one record does not overlap with the previous one. This is the start of a group. Then take the cumulative sum of this flag and use that for grouping:
select user_id, min(begin_at) as begin_at, max(end_at) as end_at
from (select s.*, sum(startflag) over (partition by user_id order by begin_at) as grp
from (select s.*,
(case when lag(end_at) over (partition by user_id order by begin_at) >= begin_at
then 0 else 1
end) as startflag
from slots s
) s
) s
group by user_id, grp;
Here is a SQL Fiddle.

Gordon Linoff already provided the answer (I upvoted).
I've used the same approach, but wanted to deal with tsrange type.
So I came up with this construct:
SELECT min(id) b_id, min(begin_at) b_at, max(end_at) e_at, grp, user_id
FROM (
SELECT t.*, sum(g) OVER (ORDER BY id) grp
FROM (
SELECT s.*, (NOT r -|- lag(r,1,r)
OVER (PARTITION BY user_id ORDER BY id))::int g
FROM (SELECT id,begin_at,end_at,user_id,
tsrange(begin_at,end_at,'[)') r FROM slots) s
) t
) u
GROUP BY grp, user_id
ORDER BY grp;
Unfortunately, on the top level one has to use min(begin_at) and max(end_at), as there're no aggregate functions for the range-based union operator +.
I create ranges with exclusive upper bounds, this allows me to use “is adjacent to” (-|-) operator. I compare current tsrange with the one on the previous row, defaulting to the current one in case there's no previous. Then I negate the comparison and cast to integer, which gives me 1 in cases when new group starts.

Related

How do I do conditional logic between rows of a bigquery table?

I'm trying to write a query that goes through a table row by row comparing the current row with the next. Then based on a condition being true will perform a calculation which is then output in a column on the same table and a null value if false.
Consider the example above:
Row 8703 will be referred to as Row 1
Row 8704 will be referred to as Row 2
I would like to, if possible, compare Row 1 bookedEnd with Row 2 bookedStart. If they are of equal value (which in this case they are) I would like to subtract Row 2 actualStartdate from Row 1 actualEnddate and output the value in minutes in a separate column named 'difference' on Row 2.
If they are not of equal value (which is true for all other columns in the example above) I would like to output a null value.
For the above table the extra column named difference would have the row values of:
8701 - Null
8702 - Null
8703 - Null
8704 - 12
8705 - Null
Since you are writing to "Row 2", I use the LAG() function so you are comparing on the row you are writing.
with data as (select * from `project.dataset.table`),
lagged as (
select
*,
lag(bookedEnd,1) over(partition by roomID order by Row asc) as prev_bookedEnd,
lag(actualEnddate,1) over(partition by roomID order by Row asc) as prev_actualEnddate
from data
)
select
* except (prev_bookedEnd,prev_actualEnddate),
case when prev_bookedEnd = bookedStart then timestamp_diff(prev_actualEndDate,actualStartdate, minute) else null end as difference
from lagged
What you will want to do in this scenario is use the lead function
https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#lead
it would look similar to
SELECT bookedEnd
, CASE WHEN bookedEnd = LEAD(bookedStart) OVER (PARTITION BY roomid ORDER BY Row) then XXXX END as actualStartdate
, CASE WHEN bookedEnd = LEAD(bookedStart) OVER (PARTITION BY roomid ORDER BY Row) then XXXX END as difference
SELECT
*,
IF( LAG(bookedEnd) OVER (PARTITION BY roomId ORDER BY bookedStart) = bookedStart,
TIMESTAMP_DIFF( actualStartdate,
LAG(actualEnddate) OVER (PARTITION BY roomId ORDER BY bookedStart),
MINUTE
),
NULL
) AS difference
FROM `project.dataset.table`

How to select k-th record per field in a single SQL query

please help me with the following problem. I have spent already one week trying to put all the logic into one SQL query​ but still got no elegant result. I hope the SQL experts could give me a hint,
I have a table which has 4 fields: date, expire_month, expire_year and value. The primary key is defined on 3 first fields. Thus for a concrete date few values are present with different expire_month, expire_year. I need to chose one value from them for every date, present in the table.
For example, when I execute a query:
SELECT date, expire_month, expire_year, value FROM futures
WHERE date = ‘1989-12-01' ORDER BY expire_year, expire_month;
I get a list of values for the same date sorted by expirity (months are coded with letters):
1989-12-01 Z 1989 408.25
1989-12-01 H 1990 408.25
1989-12-01 K 1990 389
1989-12-01 N 1990 359.75
1989-12-01 U 1990 364.5
1989-12-01 Z 1990 375
The correct single value for that date is the k-th record from top. For example, of k is 2 then the «correct single» record would be:
1989-12-01 H 1990 408.25
How can I select these «correct single» values for every date in my table?
You can do it with rank():
select t.date, t.expire_month, t.expire_year, t.value from (
select *,
rank() over(partition by date order by expire_year, expire_month) rn
from futures
) t
where t.rn = 2
The column rn in the subquery, is actually the rank of the row grouped by date. Change 2 to the rank you want.
While forpas's answer is the better one (Though I think I'd use row_number() instead of rank() here), window functions are fairly recent additions to Sqlite (In 3.25). If you're stuck on an old version and can't upgrade, here's an alternative:
SELECT date, expire_month, expire_year, value
FROM futures AS f
WHERE (date, expire_month, expire_year) =
(SELECT f2.date, f2.expire_month, f2.expire_year
FROM futures AS f2
WHERE f.date = f2.date
ORDER BY f2.expire_year, f2.expire_month
LIMIT 1 OFFSET 1)
ORDER BY date;
The OFFSET value is 1 less than the Kth row - so 1 for the second row, 2 for the third row, etc.
It executes a correlated subquery for every row in the table, though, which isn't ideal. Hopefully your composite primary key columns are in the order date, expire_year, expire_month, which will help a lot by eliminating the need for additional sorting in it.
You can try the following query.
select * from
(
SELECT rownum seq, date1, expire_month, expire_year, value FROM testtable
WHERE date1 = to_date('1989-12-01','yyyy-mm-dd')
ORDER BY expire_year, expire_month
)
where seq=2

SQL Server iterating through time series data

I am using SQL Server and wondering if it is possible to iterate through time series data until specific condition is met and based on that label my data in other table?
For example, let's say I have a table like this:
Id Date Some_kind_of_event
+--+----------+------------------
1 |2018-01-01|dsdf...
1 |2018-01-06|sdfs...
1 |2018-01-29|fsdfs...
2 |2018-05-10|sdfs...
2 |2018-05-11|fgdf...
2 |2018-05-12|asda...
3 |2018-02-15|sgsd...
3 |2018-02-16|rgw...
3 |2018-02-17|sgs...
3 |2018-02-28|sgs...
What I want to get, is to calculate for each key the difference between two adjacent events and find out if there exists difference > 10 days between these two adjacent events. In case yes, I want to stop iterating for that specific key and put label 'inactive', otherwise 'active' in my other table. After we finish with one key, we start with another.
So for example id = 1 would get label 'inactive' because there exists two dates which have difference bigger that 10 days. The final result would be like that:
Id Label
+--+----------+
1 |inactive
2 |active
3 |inactive
Any ideas how to do that? Is it possible to do it with SQL?
When working with a DBMS you need to get away from the idea of thinking iteratively. Instead you need to try and think in sets. "Instead of thinking about what you want to do to a row, think about what you want to do to a column."
If I understand correctly, is this what you're after?
CREATE TABLE SomeEvent (ID int, EventDate date, EventName varchar(10));
INSERT INTO SomeEvent
VALUES (1,'20180101','dsdf...'),
(1,'20180106','sdfs...'),
(1,'20180129','fsdfs..'),
(2,'20180510','sdfs...'),
(2,'20180511','fgdf...'),
(2,'20180512','asda...'),
(3,'20180215','sgsd...'),
(3,'20180216','rgw....'),
(3,'20180217','sgs....'),
(3,'20180228','sgs....');
GO
WITH Gaps AS(
SELECT *,
DATEDIFF(DAY,LAG(EventDate) OVER (PARTITION BY ID ORDER BY EventDate),EventDate) AS EventGap
FROM SomeEvent)
SELECT ID,
CASE WHEN MAX(EventGap) > 10 THEN 'inactive' ELSE 'active' END AS Label
FROM Gaps
GROUP BY ID
ORDER BY ID;
GO
DROP TABLE SomeEvent;
GO
This assumes you are using SQL Server 2012+, as it uses the LAG function, and SQL Server 2008 has less than 12 months of any kind of support.
Try this. Note, replace #MyTable with your actual table.
WITH Diffs AS (
SELECT
Id
,DATEDIFF(DAY,[Date],LEAD([Date],1,0) OVER (ORDER BY [Id], [Date])) Diff
FROM #MyTable)
SELECT
Id
,CASE WHEN MAX(Diff) > 10 THEN 'Inactive' ELSE 'Active' END
FROM Diffs
GROUP BY Id
Just to share another approach (without a CTE).
SELECT
ID
, CASE WHEN SUM(TotalDays) = (MAX(CNT) - 1) THEN 'Active' ELSE 'Inactive' END Label
FROM (
SELECT
ID
, EventDate
, CASE WHEN DATEDIFF(DAY, EventDate, LEAD(EventDate) OVER(PARTITION BY ID ORDER BY EventDate)) < 10 THEN 1 ELSE 0 END TotalDays
, COUNT(ID) OVER(PARTITION BY ID) CNT
FROM EventsTable
) D
GROUP BY ID
The method is counting how many records each ID has, and getting the TotalDays by date differences (in days) between the current the next date, if the difference is less than 10 days, then give me 1, else give me 0.
Then compare, if the total days equal the number of records that each ID has (minus one) would print Active, else Inactive.
This is just another approach that doesn't use CTE.

Group by one column and substring of its own

Unable to write a Sql for my problem.
I have a table with 2 columns item code and expiration date.
Itemcode. Expiration
Abc123. 2014-08-08
Abc234. 2014-07-07
Cfg345. 2014-06-06
Cfg567. 2014-07-08
The output should be based on first 3 digits of item code and minimum expirarion date like below
Abc. 2014-07-07. Abc234
Cfg. 2014-06-06. Cfg345
Thanks
EDITED:
The query goes like this which actually is joining multiple tables to fetch the itemcode and expiration.
select substr(y.itemcode,1,3),
min(x.expiration_date) expiry,
y.itemcode
from X x, Y y
where y.id = x.id
and x.number in
(select number from xyz
where id = x.id
and codec in ('C', 'M', 'T', 'H')
)
group by substr(y.itemcode,1,3), y.itemcode
I am not familiar with "m". Here is an ANSI standard SQL solution:
select substring(itemcode, 1, 3), expiration, itemcode
from (select t.*,
row_number() over (partition by substring(itemcode, 1, 3)
order by expiration desc
) as seqnum
from table t
) t
where seqnum = 1;
Most databases support this functionality. Some might have slightly different names (such as substr() or left() for the substring operation).

SQL query group by nearby timestamp

I have a table with a timestamp column. I would like to be able to group by an identifier column (e.g. cusip), sum over another column (e.g. quantity), but only for rows that are within 30 seconds of each other, i.e. not in fixed 30 second bucket intervals. Given the data:
cusip| quantity| timestamp
============|=========|=============
BE0000310194| 100| 16:20:49.000
BE0000314238| 50| 16:38:38.110
BE0000314238| 50| 16:46:21.323
BE0000314238| 50| 16:46:35.323
I would like to write a query that returns:
cusip| quantity
============|=========
BE0000310194| 100
BE0000314238| 50
BE0000314238| 100
Edit:
In addition, it would greatly simplify things if I could also get the MIN(timestamp) out of the query.
From Sean G solution, I have removed Group By on complete Table. In Fact re adjected few parts for Oracle SQL.
First after finding previous time, assign self parent id. If there a null in Previous Time, then we exclude giving it an ID.
Now based on take the nearest self parent id by avoiding nulls so that all nearest 30 seconds cusip fall under one Group.
As There is a CUSIP column, I assumed the dataset would be large market transactional data. Instead using group by on complete table, use partition by CUSIP and final Group Parent ID for better performance.
SELECT
id,
sub.parent_id,
sub.cusip,
timestamp,
quantity,
sum(sub.quantity) OVER(
PARTITION BY cusip, parent_id
) sum_quantity,
MIN(sub.timestamp) OVER(
PARTITION BY cusip, parent_id
) min_timestamp
FROM
(
SELECT
base_sub.*,
CASE
WHEN base_sub.self_parent_id IS NOT NULL THEN
base_sub.self_parent_id
ELSE
LAG(base_sub.self_parent_id) IGNORE NULLS OVER(
PARTITION BY cusip
ORDER BY
timestamp, id
)
END parent_id
FROM
(
SELECT
c.*,
CASE
WHEN nvl(abs(EXTRACT(SECOND FROM to_timestamp(previous_timestamp, 'yyyy/mm/dd hh24:mi:ss') - to_timestamp
(timestamp, 'yyyy/mm/dd hh24:mi:ss'))), 31) > 30 THEN
id
ELSE
NULL
END self_parent_id
FROM
(
SELECT
my_table.id,
my_table.cusip,
my_table.timestamp,
my_table.quantity,
LAG(my_table.timestamp) OVER(
PARTITION BY my_table.cusip
ORDER BY
my_table.timestamp, my_table.id
) previous_timestamp
FROM
my_table
) c
) base_sub
) sub
Below are the Table Rows
Input Data:
Below is the Output
RESULT
Following may be helpful to you.
Grouping of 30 second periods stating form a given time. Here it is '2012-01-01 00:00:00'. DATEDIFF counts the number of seconds between time stamp value and stating time. Then its is divided by 30 to get grouping column.
SELECT MIN(TimeColumn) AS TimeGroup, SUM(Quantity) AS TotalQuantity FROM YourTable
GROUP BY (DATEDIFF(ss, TimeColumn, '2012-01-01') / 30)
Here minimum time stamp of each group will output as TimeGroup. But you can use maximum or even grouping column value can be converted to time again for display.
Looking at the above comments, I'm assuming Chris's first scenario is the one you want (all 3 get grouped even though values 1 and 3 are not within 30 seconds of eachother, but are each within 30 seconds of value 2). Also going to assume that each row in your table has some unique ID called 'id'. You can do the following:
Create a new grouping, determining if the preceding row in your partition is more than 30 seconds behind the current row (e.g. determine if you need a new 30 second grouping, or to continue the previous). We'll call that parent_id.
Sum quantity over parent_id (plus any other aggregations)
The code could look like this
select
sub.parent_id,
sub.cusip,
min(sub.timestamp) min_timestamp,
sum(sub.quantity) quantity
from
(
select
base_sub.*,
case
when base_sub.self_parent_id is not null
then base_sub.self_parent_id
else lag(base_sub.self_parent_id) ignore nulls over (
partition by
my_table.cusip
order by
my_table.timestamp,
my_table.id
) parent_id
from
(
select
my_table.id,
my_table.cusip,
my_table.timestamp,
my_table.quantity,
lag(my_table.timestamp) over (
partition by
my_table.cusip
order by
my_table.timestamp,
my_table.id
) previous_timestamp,
case
when datediff(
second,
nvl(previous_timestamp, to_date('1900/01/01', 'yyyy/mm/dd')),
my_table.timestamp) > 30
then my_table.id
else null
end self_parent_id
from
my_table
) base_sub
) sub
group by
sub.time_group_parent_id,
sub.cusip