How do I do conditional logic between rows of a bigquery table? - sql

I'm trying to write a query that goes through a table row by row comparing the current row with the next. Then based on a condition being true will perform a calculation which is then output in a column on the same table and a null value if false.
Consider the example above:
Row 8703 will be referred to as Row 1
Row 8704 will be referred to as Row 2
I would like to, if possible, compare Row 1 bookedEnd with Row 2 bookedStart. If they are of equal value (which in this case they are) I would like to subtract Row 2 actualStartdate from Row 1 actualEnddate and output the value in minutes in a separate column named 'difference' on Row 2.
If they are not of equal value (which is true for all other columns in the example above) I would like to output a null value.
For the above table the extra column named difference would have the row values of:
8701 - Null
8702 - Null
8703 - Null
8704 - 12
8705 - Null

Since you are writing to "Row 2", I use the LAG() function so you are comparing on the row you are writing.
with data as (select * from `project.dataset.table`),
lagged as (
select
*,
lag(bookedEnd,1) over(partition by roomID order by Row asc) as prev_bookedEnd,
lag(actualEnddate,1) over(partition by roomID order by Row asc) as prev_actualEnddate
from data
)
select
* except (prev_bookedEnd,prev_actualEnddate),
case when prev_bookedEnd = bookedStart then timestamp_diff(prev_actualEndDate,actualStartdate, minute) else null end as difference
from lagged

What you will want to do in this scenario is use the lead function
https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#lead
it would look similar to
SELECT bookedEnd
, CASE WHEN bookedEnd = LEAD(bookedStart) OVER (PARTITION BY roomid ORDER BY Row) then XXXX END as actualStartdate
, CASE WHEN bookedEnd = LEAD(bookedStart) OVER (PARTITION BY roomid ORDER BY Row) then XXXX END as difference

SELECT
*,
IF( LAG(bookedEnd) OVER (PARTITION BY roomId ORDER BY bookedStart) = bookedStart,
TIMESTAMP_DIFF( actualStartdate,
LAG(actualEnddate) OVER (PARTITION BY roomId ORDER BY bookedStart),
MINUTE
),
NULL
) AS difference
FROM `project.dataset.table`

Related

Get value of same column from next row if current column value is null

I have a table and I want to select one column such as if it's record not found(cause I have joins with other tables) or exists but is null than select value of same column from next row. I tried to use isnull and coalesce functions but I am unable to get value of next row.
Any help or link would be appreciated.
Here is my query so far
Select
(select top 1 OpenPrice from #tbltempData where Dated=D.Dated) [Open],
ISNULL((select top 1 ClosePrice from #tbltempData where Dated= DATEADD(hour,#Interval-1, D.Dated)),
(select top 1 ClosePrice from #tbltempData where Dated= DATEADD(hour,0, D.Dated))) [Close],
[Min],[Max],Dated
from #tbltempData2 D
Order BY Dated Asc
Open column is having null values.
here is Screenshot of my sample data
and here is output am getting
Details: as I have records in my sample data for date '28/06/2019' and time for first record is 9 am and I am grouping my data in 2 hours so after grouping my first group record of same date is for 8am and as I have no value for that time in sample data so am getting null values. to avoid this scenario I want to get OpenPrice value where time is 9am(in sample data) of same date cause that time is in same group.
If you want "next row" always greater than current time
[Open] = (
select top 1 OpenPrice
from #tbltempData t
where DATEDIFF(day,t.Dated,D.Dated) = 0 -- make sure the price for same day
AND t.Dated>=D.Dated
ORDER BY t.Dated ASC
)
In case you want "next row" be closest available time slot
[Open] = (
select top 1 OpenPrice
from #tbltempData t
where DATEDIFF(day,t.Dated,D.Dated) = 0 -- make sure the price for same day
ORDER BY ABS(DATEDIFF(minute,t.Dated,D.Dated)) ASC
)
I think a correlated subquery does what you want:
select d.*,
(select top (1) ClosePrice
from #tbltempData td
where td.Dated <= D.Dated
order by td.Dated desc
) as ClosePrice
from #tbltempData2 d
order by dated Asc

SQL Server iterating through time series data

I am using SQL Server and wondering if it is possible to iterate through time series data until specific condition is met and based on that label my data in other table?
For example, let's say I have a table like this:
Id Date Some_kind_of_event
+--+----------+------------------
1 |2018-01-01|dsdf...
1 |2018-01-06|sdfs...
1 |2018-01-29|fsdfs...
2 |2018-05-10|sdfs...
2 |2018-05-11|fgdf...
2 |2018-05-12|asda...
3 |2018-02-15|sgsd...
3 |2018-02-16|rgw...
3 |2018-02-17|sgs...
3 |2018-02-28|sgs...
What I want to get, is to calculate for each key the difference between two adjacent events and find out if there exists difference > 10 days between these two adjacent events. In case yes, I want to stop iterating for that specific key and put label 'inactive', otherwise 'active' in my other table. After we finish with one key, we start with another.
So for example id = 1 would get label 'inactive' because there exists two dates which have difference bigger that 10 days. The final result would be like that:
Id Label
+--+----------+
1 |inactive
2 |active
3 |inactive
Any ideas how to do that? Is it possible to do it with SQL?
When working with a DBMS you need to get away from the idea of thinking iteratively. Instead you need to try and think in sets. "Instead of thinking about what you want to do to a row, think about what you want to do to a column."
If I understand correctly, is this what you're after?
CREATE TABLE SomeEvent (ID int, EventDate date, EventName varchar(10));
INSERT INTO SomeEvent
VALUES (1,'20180101','dsdf...'),
(1,'20180106','sdfs...'),
(1,'20180129','fsdfs..'),
(2,'20180510','sdfs...'),
(2,'20180511','fgdf...'),
(2,'20180512','asda...'),
(3,'20180215','sgsd...'),
(3,'20180216','rgw....'),
(3,'20180217','sgs....'),
(3,'20180228','sgs....');
GO
WITH Gaps AS(
SELECT *,
DATEDIFF(DAY,LAG(EventDate) OVER (PARTITION BY ID ORDER BY EventDate),EventDate) AS EventGap
FROM SomeEvent)
SELECT ID,
CASE WHEN MAX(EventGap) > 10 THEN 'inactive' ELSE 'active' END AS Label
FROM Gaps
GROUP BY ID
ORDER BY ID;
GO
DROP TABLE SomeEvent;
GO
This assumes you are using SQL Server 2012+, as it uses the LAG function, and SQL Server 2008 has less than 12 months of any kind of support.
Try this. Note, replace #MyTable with your actual table.
WITH Diffs AS (
SELECT
Id
,DATEDIFF(DAY,[Date],LEAD([Date],1,0) OVER (ORDER BY [Id], [Date])) Diff
FROM #MyTable)
SELECT
Id
,CASE WHEN MAX(Diff) > 10 THEN 'Inactive' ELSE 'Active' END
FROM Diffs
GROUP BY Id
Just to share another approach (without a CTE).
SELECT
ID
, CASE WHEN SUM(TotalDays) = (MAX(CNT) - 1) THEN 'Active' ELSE 'Inactive' END Label
FROM (
SELECT
ID
, EventDate
, CASE WHEN DATEDIFF(DAY, EventDate, LEAD(EventDate) OVER(PARTITION BY ID ORDER BY EventDate)) < 10 THEN 1 ELSE 0 END TotalDays
, COUNT(ID) OVER(PARTITION BY ID) CNT
FROM EventsTable
) D
GROUP BY ID
The method is counting how many records each ID has, and getting the TotalDays by date differences (in days) between the current the next date, if the difference is less than 10 days, then give me 1, else give me 0.
Then compare, if the total days equal the number of records that each ID has (minus one) would print Active, else Inactive.
This is just another approach that doesn't use CTE.

TSQL syntax to feed results into subquery

I'm after some help on how best to write a query that does the following. I think I need a subquery but I don't know how to use the data returned in the row to feed back into the subquery without hardcoding values? A subquery may not be the right thing here?
Ideally I only want 1 variable ...WHERE t_Date = '2018-01-01'
Desired Output:
The COUNT Criteria column has the following rules
Date < current row
Area = current row
Name = current row
Value = 1
For example, the first row indicates there are 2 records with Date < '2018-01-01' AND Area = 'Area6' AND Name = 'Name1' AND Value = 1
Example Data:
SQLFiddle: http://sqlfiddle.com/#!18/92ba3/4
Effectively I only want to return the first 2 rows but summarise the historic data into a column based on the output in that column.
The right way to do this is to use the cumulative sum functionality in ANSI SQL and SQL Server since 2012:
select t.*,
sum(case when t.value = 1 then 1 else 0 end) over (partition by t_area, t_name order by t_date)
from t;
This actually includes the current row. If you have only one row per date (for the area/name combo), then you can just subtract it or use a windowing clause:
select t.*,
sum(case when t.value = 1 then 1 else 0 end) over
(partition by t_area, t_name
order by t_date
rows between unbounded preceding and 1 preceding
)
from t;
Use a self join to find records in the same table that are related to a particular record:
SELECT t1.t_Date, t1.t_Area, t1.t_Name, t1.t_Value,
COUNT(t2.t_Name) AS COUNTCriteria
FROM Table1 as t1
LEFT OUTER JOIN Table1 as t2
ON t1.t_Area=t2.t_Area
AND t1.t_Name=t2.T_Name
AND t2.t_Date<t1.t_Date
AND t2.t_Value=1
GROUP BY t1.t_Date, t1.t_Area, t1.t_Name, t1.t_Value

SQL query group by nearby timestamp

I have a table with a timestamp column. I would like to be able to group by an identifier column (e.g. cusip), sum over another column (e.g. quantity), but only for rows that are within 30 seconds of each other, i.e. not in fixed 30 second bucket intervals. Given the data:
cusip| quantity| timestamp
============|=========|=============
BE0000310194| 100| 16:20:49.000
BE0000314238| 50| 16:38:38.110
BE0000314238| 50| 16:46:21.323
BE0000314238| 50| 16:46:35.323
I would like to write a query that returns:
cusip| quantity
============|=========
BE0000310194| 100
BE0000314238| 50
BE0000314238| 100
Edit:
In addition, it would greatly simplify things if I could also get the MIN(timestamp) out of the query.
From Sean G solution, I have removed Group By on complete Table. In Fact re adjected few parts for Oracle SQL.
First after finding previous time, assign self parent id. If there a null in Previous Time, then we exclude giving it an ID.
Now based on take the nearest self parent id by avoiding nulls so that all nearest 30 seconds cusip fall under one Group.
As There is a CUSIP column, I assumed the dataset would be large market transactional data. Instead using group by on complete table, use partition by CUSIP and final Group Parent ID for better performance.
SELECT
id,
sub.parent_id,
sub.cusip,
timestamp,
quantity,
sum(sub.quantity) OVER(
PARTITION BY cusip, parent_id
) sum_quantity,
MIN(sub.timestamp) OVER(
PARTITION BY cusip, parent_id
) min_timestamp
FROM
(
SELECT
base_sub.*,
CASE
WHEN base_sub.self_parent_id IS NOT NULL THEN
base_sub.self_parent_id
ELSE
LAG(base_sub.self_parent_id) IGNORE NULLS OVER(
PARTITION BY cusip
ORDER BY
timestamp, id
)
END parent_id
FROM
(
SELECT
c.*,
CASE
WHEN nvl(abs(EXTRACT(SECOND FROM to_timestamp(previous_timestamp, 'yyyy/mm/dd hh24:mi:ss') - to_timestamp
(timestamp, 'yyyy/mm/dd hh24:mi:ss'))), 31) > 30 THEN
id
ELSE
NULL
END self_parent_id
FROM
(
SELECT
my_table.id,
my_table.cusip,
my_table.timestamp,
my_table.quantity,
LAG(my_table.timestamp) OVER(
PARTITION BY my_table.cusip
ORDER BY
my_table.timestamp, my_table.id
) previous_timestamp
FROM
my_table
) c
) base_sub
) sub
Below are the Table Rows
Input Data:
Below is the Output
RESULT
Following may be helpful to you.
Grouping of 30 second periods stating form a given time. Here it is '2012-01-01 00:00:00'. DATEDIFF counts the number of seconds between time stamp value and stating time. Then its is divided by 30 to get grouping column.
SELECT MIN(TimeColumn) AS TimeGroup, SUM(Quantity) AS TotalQuantity FROM YourTable
GROUP BY (DATEDIFF(ss, TimeColumn, '2012-01-01') / 30)
Here minimum time stamp of each group will output as TimeGroup. But you can use maximum or even grouping column value can be converted to time again for display.
Looking at the above comments, I'm assuming Chris's first scenario is the one you want (all 3 get grouped even though values 1 and 3 are not within 30 seconds of eachother, but are each within 30 seconds of value 2). Also going to assume that each row in your table has some unique ID called 'id'. You can do the following:
Create a new grouping, determining if the preceding row in your partition is more than 30 seconds behind the current row (e.g. determine if you need a new 30 second grouping, or to continue the previous). We'll call that parent_id.
Sum quantity over parent_id (plus any other aggregations)
The code could look like this
select
sub.parent_id,
sub.cusip,
min(sub.timestamp) min_timestamp,
sum(sub.quantity) quantity
from
(
select
base_sub.*,
case
when base_sub.self_parent_id is not null
then base_sub.self_parent_id
else lag(base_sub.self_parent_id) ignore nulls over (
partition by
my_table.cusip
order by
my_table.timestamp,
my_table.id
) parent_id
from
(
select
my_table.id,
my_table.cusip,
my_table.timestamp,
my_table.quantity,
lag(my_table.timestamp) over (
partition by
my_table.cusip
order by
my_table.timestamp,
my_table.id
) previous_timestamp,
case
when datediff(
second,
nvl(previous_timestamp, to_date('1900/01/01', 'yyyy/mm/dd')),
my_table.timestamp) > 30
then my_table.id
else null
end self_parent_id
from
my_table
) base_sub
) sub
group by
sub.time_group_parent_id,
sub.cusip

SQL if breaking number pattern, mark record?

I have the following query:
SELECT AccountNumber, RptPeriod
FROM dbo.Report
ORDER BY AccountNumber, RptPeriod.
I get the following results:
123 200801
123 200802
123 200803
234 200801
344 200801
344 200803
I need to mark the record where the rptperiod doesnt flow concurrently for the account. For example 344 200803 would have an X next to it since it goes from 200801 to 200803.
This is for about 19321 rows and I want it on a company basis so between different companies I dont care what the numbers are, I just want the same company to show where there is breaks in the number pattern.
Any Ideas??
Thanks!
OK, this is kind of ugly (double join + anti-join) but it gets the work done, AND is pure portable SQL:
SELECT *
FROM dbo.Report R1
, dbo.Report R2
WHERE R1.AccountNumber = R2.AccountNumber
AND R2.RptPeriod - R1.RptPeriod > 1
-- subsequent NOT EXISTS ensures that R1,R2 rows found are "next to each other",
-- e.g. no row exists between them in the ordering above
AND NOT EXISTS
(SELECT 1 FROM dbo.Report R3
WHERE R1.AccountNumber = R3.AccountNumber
AND R2.AccountNumber = R3.AccountNumber
AND R1.RptPeriod < R3.RptPeriod
AND R3.RptPeriod < R2.RptPeriod
)
Something like this should do it:
-- cte lists all items by AccountNumber and RptPeriod, assigning an ascending integer
-- to each RptPeriod and restarting at 1 for each new AccountNumber
;WITH cte (AccountNumber, RptPeriod, Ranking)
as (select
AccountNumber
,RptPeriod
,row_number() over (partition by AccountNumber order by AccountNumber, RptPeriod) Ranking
from dbo.Report)
-- and then we join each row with each preceding row based on that "Ranking" number
select
This.AccountNumber
,This.RptPeriod
,case
when Prior.RptPeriod is null then '' -- Catches the first row in a set
when Prior.RptPeriod = This.RptPeriod - 1 then '' -- Preceding row's RptPeriod is one less that This row's RptPeriod
else 'x' -- -- Preceding row's RptPeriod is not less that This row's RptPeriod
end UhOh
from cte This
left outer join cte Prior
on Prior.AccountNumber = This.AccountNumber
and Prior.Ranking = This.Ranking - 1
(Edited to add comments)
WITH T
AS (SELECT *,
/*Each island of contiguous data will have
a unique AccountNumber,Grp combination*/
RptPeriod - ROW_NUMBER() OVER (PARTITION BY AccountNumber
ORDER BY RptPeriod ) Grp,
/*RowNumber will be used to identify first record
per company, this should not be given an 'X'. */
ROW_NUMBER() OVER (PARTITION BY AccountNumber
ORDER BY RptPeriod ) AS RN
FROM Report)
SELECT AccountNumber,
RptPeriod,
/*Check whether first in group but not first over all*/
CASE
WHEN ROW_NUMBER() OVER (PARTITION BY AccountNumber, Grp
ORDER BY RptPeriod) = 1
AND RN > 1 THEN 'X'
END AS Flag
FROM T
SELECT *
FROM report r
LEFT JOIN report r2
ON r.accountnumber = r.accountnumber
AND {r2.rptperiod is one day after r.rptPeriod}
JOIN report r3
ON r3.accountNumber = r.accountNumber
AND r3.rptperiod > r1.rptPeriod
WHERE r2.rptPeriod IS NULL
AND r3 IS NOT NULL
I'm not sure of sql servers date logic syntax, but hopefully you get the idea. r will be all the records where the next rptPeriod is NULL (r2) and there exists at least one greater rptPeriod (r3). The query isn't super straight forward I guess, but if you have an index on the two columns, it'll probably be the most efficent way to get your data.
Basically, you number rows within every account, then, using the row numbers, compare the RptPeriod values for the neighbouring rows.
It is assumed here that RptPeriod is the year and month encoded, for which case the year transition check has been added.
;WITH Report_sorted AS (
SELECT
AccountNumber,
RptPeriod,
rownum = ROW_NUMBER() OVER (PARTITION BY AccountNumber ORDER BY RptPeriod)
FROM dbo.Report
)
SELECT
AccountNumber,
RptPeriod,
CASE ISNULL(CASE WHEN r1.RptPeriod / 100 < r2.RptPeriod / 100 THEN 12 ELSE 0 END
+ r1.RptPeriod - r2.RptPeriod, 1) AS Chk
WHEN 1 THEN ''
ELSE 'X'
END
FROM Report_sorted r1
LEFT JOIN Report_sorted r2
ON r1.AccountNumber = r2.AccountNumber AND r1.rownum = r2.rownum + 1
It could be complicated further with an additional check for gaps spanning a year and more, if you need that.