Postgres SQL - Prioritize and delineate project status timeline - sql

I'm attempting to create a clear linear status for a project. I've been able to take the historical project date details and clear out any unnecessary overlap in status dates but am at the point where I now need to prioritize and delineate the general status of a project. My data is structured as below thanks to this fiddle:
Project #
hist_status
status_start_date
status_end_date
A74308
In Progress
11/17/2020
6/8/2021
A74308
Pause
4/2/2021
6/21/2021
A74308
Completed
6/8/2021
6/8/2021
A74308
In Progress
6/21/2021
9/20/2021
A74308
Pause
9/20/2021
1/30/2022
A74308
In Progress
1/30/2022
2/8/2023
A74308
Completed
4/5/2022
4/5/2022
A74308
Completed
8/16/2022
8/16/2022
A74308
Pause
8/16/2022
2/8/2023
A project will have multiple workstreams but as long as one of them is "In Progress" then the project is "In Progress." Similarly, if one workstream is "Paused" and the others are "Completed" then the Project is "Paused." If all workstreams are "Completed" then the project is "Completed." Just to nail it down, if one workstream is "In Progress" and another is "Paused" with the final being "Completed" then the project is "In Progress."
So, the result I'm trying to get is below:
Project #
hist_status
status_start_date
status_end_date
A74308
In Progress
11/17/2020
6/8/2021
A74308
Pause
6/9/2021
6/20/2021
A74308
In Progress
6/21/2021
9/20/2021
A74308
Pause
9/21/2021
1/29/2022
A74308
In Progress
1/30/2022
2/8/2023
And here is a visual representation:
This post seems to accomplish what I'm trying to do but I haven't been able to replicate the results in Postgres. I've attempted using subqueries and essentially a DATEDIFF to delineate, for instance, where the first "Pause" should begin, but can't nail down a solution.

I may have overfitted this solution but it works for this example. Edit: I did indeed overfit the solution but just to an individual project #. Had to adjust when I opened up the query to all projects.
F AS (
SELECT project_#, row_number() over (PARTITION BY project_#) P_ID,
CASE WHEN Status_Rank = 1 THEN Status_End_Date
WHEN Status_End_Date < LAG(Status_End_Date - INTERVAL '1 day', 1) OVER (PARTITION BY project_#ORDER BY Status_Start_Date) THEN NULL
WHEN LAG(Status_End_Date - INTERVAL '1 day', 1) OVER (PARTITION BY project_# ORDER BY Status_Start_Date) < LAG(Status_End_Date - INTERVAL '1 day', 2) OVER (PARTITION BY project_# ORDER BY Status_Start_Date) THEN NULL
ELSE Status_End_Date END FLAG
FROM final
), PS AS (
SELECT fi.project_#, Hist_Status, Status_Rank, fi.P_ID,
CASE WHEN LAG(Status_End_Date, 1) OVER (PARTITION BY fi.project_# ORDER BY Status_Start_Date) IS NULL THEN Status_Start_Date
WHEN Status_Rank = 1 AND LAG(STATUS_RANK, 1) OVER (PARTITION BY fi.project_# ORDER BY Status_Start_Date) = 2 THEN LAG(Status_End_Date, 1) OVER (PARTITION BY fi.project_# ORDER BY Status_Start_Date) :: DATE
WHEN Status_Rank = 1 AND LAG(STATUS_RANK, 1) OVER (PARTITION BY fi.project_id ORDER BY Status_Start_Date) = 3 THEN Status_Start_Date
WHEN Status_End_Date < LAG(Status_End_Date, 1) OVER (PARTITION BY fi.project_# ORDER BY Status_Start_Date) THEN NULL
WHEN LAG(Status_End_Date, 1) OVER (PARTITION BY fi.project_# ORDER BY Status_Start_Date) < LAG(Status_End_Date + INTERVAL '1 day', 2) OVER (PARTITION BY fi.project_# ORDER BY Status_Start_Date) THEN NULL
WHEN Status_End_Date = LAG(Status_End_Date, 1) OVER (PARTITION BY fi.project_# ORDER BY Status_Start_Date) AND Status_Start_Date > LAG(Status_Start_Date, 1) OVER (PARTITION BY fi.project_# ORDER BY Status_Start_Date) THEN NULL
ELSE LAG(Status_End_Date, 1) OVER (PARTITION BY fi.project_# ORDER BY Status_Start_Date) :: DATE END AS Status_Start_Date,
CASE WHEN Status_End_Date < LAG(Status_End_Date - INTERVAL '1 day', 1) OVER (PARTITION BY fi.project_# ORDER BY Status_Start_Date) THEN NULL
WHEN Status_Rank = 3 THEN Status_End_Date
WHEN Status_End_Date = LAG(Status_End_Date, 1) OVER (PARTITION BY fi.project_# ORDER BY Status_Start_Date) AND Status_Start_Date > LAG(Status_Start_Date, 1) OVER (PARTITION BY fi.project_# ORDER BY Status_Start_Date) THEN NULL
ELSE (Status_End_Date - INTERVAL '1 day') :: DATE END AS Status_End_Date
FROM final fi
LEFT JOIN F
ON fi.project_# = F.project_#
AND fi.P_ID = F.P_ID
WHERE F.FLAG IS NOT NULL
)
SELECT project_#, Hist_Status, Status_Start_Date, Status_End_Date,
DATE_TRUNC('WEEK', Status_Start_Date) :: DATE Week_Start_Date, DATE_TRUNC('WEEK', Status_End_Date) :: DATE Week_End_Date
FROM PS
WHERE Status_Start_Date IS NOT NULL

Related

Analyse each identifier individually in single query on PostgreSQL

Description
I have PostgreSQL table that looks like this:
identifier
state
card_presence
progress
timestamp
V000000000000123
0
true
1000
2022-12-01 12:45:02
V000000000000123
2
true
1022
2022-12-01 12:45:03
V000000000000123
3
true
1024
2022-12-01 12:48:03
V000000000000124
2
true
974
2022-12-01 12:43:00
V000000000000124
6
true
982
2022-12-01 12:55:00
I have to analyze this data quite frequently (at ~60s) intervals. First stage of analysis is a complex query which processes the data in multiple steps. At the moment the I execute the query for each identifier individually.
Basically what the query does is somewhat what is described in: Time intervals analysis in BigQuery
The query looks like:
with real_data as (
SELECT
(CASE WHEN card_presence != false THEN state ELSE -1 END) as state,
progress,
lead(timestamp) over(order by timestamp) - interval '1 second' as next_timestamp,
timestamp
FROM telemetry_tacho
WHERE driver_identifier = 'V100000165676000' AND state IS NOT NULL AND timestamp >= CURRENT_TIMESTAMP - INTERVAL '2 weeks'
), sample_by_second as (
SELECT
state,
progress,
ts,
date_trunc('minute', ts) ts_minute
FROM
real_data,
generate_series(
timestamp,
coalesce(
next_timestamp,
date_trunc('minute', timestamp + interval '60 seconds')
),
interval '1 second'
) ts
), sample_by_second_with_weight as (
SELECT
state,
MIN(progress) as min_progress,
MAX(progress) as max_progress,
ts_minute,
count(*) weight
FROM sample_by_second
GROUP BY state, ts_minute
), sample_by_minute as (
SELECT
ts_minute,
(array_agg(state ORDER BY weight DESC))[1] as state,
MIN(min_progress) as min_progress,
MAX(max_progress) as max_progress
FROM sample_by_second_with_weight
GROUP BY ts_minute
), add_previous_state as (
SELECT
ts_minute,
state,
min_progress,
max_progress,
lag(state) OVER (ORDER BY ts_minute) as prev_state
FROM sample_by_minute
), add_group_indication as (
SELECT
ts_minute,
state,
min_progress,
max_progress,
SUM(CASE
WHEN state = 0 AND prev_state = -1 THEN 0
WHEN state = -1 AND prev_state = 0 THEN 0
WHEN state != prev_state THEN 1
ELSE 0
END) over (order by ts_minute) as group_id
FROM add_previous_state
), computed as (
select
group_id,
min(ts_minute) as ts_minute_min,
max(ts_minute) as ts_minute_max,
min(state) as state,
MIN(min_progress) as min_progress,
MAX(max_progress) as max_progress,
min(ts_minute) as start_timestamp,
max(ts_minute) + interval '1 minute' end_timestamp,
60 * count(*) as duration
from add_group_indication
group by group_id
), include_surrounding_states as (
select
*,
lag(state) over(order by start_timestamp) prev_state,
lead(state) over(order by start_timestamp) next_state
from computed
), filter_out_invalid_states as (
select
state,
min_progress,
max_progress,
start_timestamp,
end_timestamp,
lag(state) over(order by start_timestamp) prev_state,
lead(state) over(order by start_timestamp) next_state
from include_surrounding_states
where not (state = 2 AND prev_state = 3 AND next_state = 3 AND duration = 60)
), recalculate_group_id as (
select
SUM(CASE WHEN state != prev_state THEN 1 ELSE 0 END) over (order by start_timestamp) as group_id,
state,
min_progress,
max_progress,
COALESCE(start_timestamp, CURRENT_TIMESTAMP - INTERVAL '2 weeks') as start_timestamp, -- Add period start timestamp for the first entry
COALESCE(end_timestamp, CURRENT_TIMESTAMP) as end_timestamp
from filter_out_invalid_states
), final_data as (
SELECT
MAX(state) AS state,
MIN(min_progress) AS min_progress,
MAX(max_progress) AS max_progress,
MAX(max_progress) - MIN(min_progress) AS progress_diff,
EXTRACT('epoch' FROM min(start_timestamp))::integer AS start_timestamp,
EXTRACT('epoch' FROM max(end_timestamp))::integer AS end_timestamp,
EXTRACT('epoch' FROM (max(end_timestamp) - min(start_timestamp))::interval)::integer AS duration
FROM recalculate_group_id
GROUP BY group_id
ORDER BY start_timestamp ASC
)
select * from final_data;
Sample data
Input
"identifier","card_presence","state","progress","timestamp"
"0000000000000123",TRUE,0,100000,"2022-12-01 00:00:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-01 10:00:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-01 10:05:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-01 15:00:02+00"
"0000000000000123",TRUE,3,100000,"2022-12-01 15:45:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-01 20:15:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-01 20:15:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 05:14:45+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 05:15:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 05:15:01+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 06:10:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 07:11:20+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 07:11:28+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 07:13:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 08:01:06+00"
"0000000000000123",TRUE,0,100000,"2022-12-02 08:30:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 08:30:10+00"
"0000000000000123",TRUE,0,100000,"2022-12-02 09:45:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 10:30:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-02 15:00:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 15:45:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-02 16:45:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-03 01:45:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-03 02:25:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-03 05:18:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-03 06:15:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-03 07:00:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-03 11:30:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-03 12:15:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-03 13:15:00+00"
Output
"state","min_progress","max_progress","progress_diff","start_timestamp","end_timestamp","duration"
0,100000,100000,0,1669852800,1669889100,36300
3,100000,100000,0,1669889100,1669906800,17700
0,100000,100000,0,1669906800,1669909500,2700
3,100000,100000,0,1669909500,1669925700,16200
0,100000,100000,0,1669925700,1669958100,32400
3,100000,100000,0,1669958100,1669974300,16200
0,100000,100000,0,1669974300,1669977000,2700
3,100000,100000,0,1669977000,1669993200,16200
0,100000,100000,0,1669993200,1669995900,2700
3,100000,100000,0,1669995900,1669999500,3600
0,100000,100000,0,1669999500,1670031900,32400
3,100000,100000,0,1670031900,1670048100,16200
0,100000,100000,0,1670048100,1670050800,2700
3,100000,100000,0,1670050800,1670067000,16200
0,100000,100000,0,1670067000,1670069700,2700
3,100000,100000,0,1670069700,1670073300,3600
0,100000,100000,0,1670073300,1670073420,120
Question
The query usually takes some time to process for each device, and, I find that constantly querying for and analysing that data for each identifier separately is quite time consuming, so I thought, maybe it would be possible to pre-process that data for all devices periodically and store analysed results in separate table or materialized view.
Now the thing of running the query periodically and saving the results to a separate table or a materialized view isn't that hard, but is it possible to do that for all identifier values that exist on the table at once?
I believe that the query could be updated to do that, but I fail to grasp the concept on how to do so.
Without delving into your logic of analysis I may suggest this:
extract the list of distinct driver_identifier-s or have it stored in a materialized view too;
select from this list lateral join with your query.
Your query shall be changed a bit too, replace driver_identifier = 'V100000165676000' with driver_identifier = dil.drid to correlate it with the identifiers' list.
with driver_identifier_list(drid) as
(
select distinct driver_identifier from telemetry_tacho
)
select l.*
from driver_identifier_list as dil
cross join lateral
(
-- your query (where driver_identifier = dil.drid) here
) as l;
Effectively this is a loop that runs your query for every driver_identifier value. However the view(s) are to be refreshed on every telemetry_tacho mutation which makes the effectiveness of the materialized view approach questionable.

DBT - use DBT modeling to insert rows in a table like date dimension table in Azure Synapse

I need reference to inserting rows in a table using DBT models. Sample example that can be considered is a date dimension table, where we want to insert rows for next years.
dbt is built to handle the inserts for you since it generally works as a transformation layer on data already in your warehouse.
As an example of how to build a date dimension table, the gitlab data team have a public repo which includes an example of how to build that using the dbt-utils package macro for a date spine
The simplest version would just be:
date_dim.sql
WITH date_spine AS (
{{ dbt_utils.date_spine(
start_date="to_date('01/01/2000', 'mm/dd/yyyy')",
datepart="day",
end_date="to_date('12/01/2050', 'mm/dd/yyyy')"
)
}}
)
select * from date_spine
And the link to the gitlab example:
date_details_source.sql
WITH date_spine AS (
{{ dbt_utils.date_spine(
start_date="to_date('11/01/2009', 'mm/dd/yyyy')",
datepart="day",
end_date="dateadd(year, 40, current_date)"
)
}}
), calculated as (
SELECT
date_day,
date_day AS date_actual,
DAYNAME(date_day) AS day_name,
DATE_PART('month', date_day) AS month_actual,
DATE_PART('year', date_day) AS year_actual,
DATE_PART(quarter, date_day) AS quarter_actual,
DATE_PART(dayofweek, date_day) + 1 AS day_of_week,
CASE WHEN day_name = 'Sun' THEN date_day
ELSE DATEADD('day', -1, DATE_TRUNC('week', date_day)) END AS first_day_of_week,
CASE WHEN day_name = 'Sun' THEN WEEK(date_day) + 1
ELSE WEEK(date_day) END AS week_of_year_temp, --remove this column
CASE WHEN day_name = 'Sun' AND LEAD(week_of_year_temp) OVER (ORDER BY date_day) = '1'
THEN '1'
ELSE week_of_year_temp END AS week_of_year,
DATE_PART('day', date_day) AS day_of_month,
ROW_NUMBER() OVER (PARTITION BY year_actual, quarter_actual ORDER BY date_day) AS day_of_quarter,
ROW_NUMBER() OVER (PARTITION BY year_actual ORDER BY date_day) AS day_of_year,
CASE WHEN month_actual < 2
THEN year_actual
ELSE (year_actual+1) END AS fiscal_year,
CASE WHEN month_actual < 2 THEN '4'
WHEN month_actual < 5 THEN '1'
WHEN month_actual < 8 THEN '2'
WHEN month_actual < 11 THEN '3'
ELSE '4' END AS fiscal_quarter,
ROW_NUMBER() OVER (PARTITION BY fiscal_year, fiscal_quarter ORDER BY date_day) AS day_of_fiscal_quarter,
ROW_NUMBER() OVER (PARTITION BY fiscal_year ORDER BY date_day) AS day_of_fiscal_year,
TO_CHAR(date_day, 'MMMM') AS month_name,
TRUNC(date_day, 'Month') AS first_day_of_month,
LAST_VALUE(date_day) OVER (PARTITION BY year_actual, month_actual ORDER BY date_day) AS last_day_of_month,
FIRST_VALUE(date_day) OVER (PARTITION BY year_actual ORDER BY date_day) AS first_day_of_year,
LAST_VALUE(date_day) OVER (PARTITION BY year_actual ORDER BY date_day) AS last_day_of_year,
FIRST_VALUE(date_day) OVER (PARTITION BY year_actual, quarter_actual ORDER BY date_day) AS first_day_of_quarter,
LAST_VALUE(date_day) OVER (PARTITION BY year_actual, quarter_actual ORDER BY date_day) AS last_day_of_quarter,
FIRST_VALUE(date_day) OVER (PARTITION BY fiscal_year, fiscal_quarter ORDER BY date_day) AS first_day_of_fiscal_quarter,
LAST_VALUE(date_day) OVER (PARTITION BY fiscal_year, fiscal_quarter ORDER BY date_day) AS last_day_of_fiscal_quarter,
FIRST_VALUE(date_day) OVER (PARTITION BY fiscal_year ORDER BY date_day) AS first_day_of_fiscal_year,
LAST_VALUE(date_day) OVER (PARTITION BY fiscal_year ORDER BY date_day) AS last_day_of_fiscal_year,
DATEDIFF('week', first_day_of_fiscal_year, date_actual) +1 AS week_of_fiscal_year,
CASE WHEN EXTRACT('month', date_day) = 1 THEN 12
ELSE EXTRACT('month', date_day) - 1 END AS month_of_fiscal_year,
LAST_VALUE(date_day) OVER (PARTITION BY first_day_of_week ORDER BY date_day) AS last_day_of_week,
(year_actual || '-Q' || EXTRACT(QUARTER FROM date_day)) AS quarter_name,
(fiscal_year || '-' || DECODE(fiscal_quarter,
1, 'Q1',
2, 'Q2',
3, 'Q3',
4, 'Q4')) AS fiscal_quarter_name,
('FY' || SUBSTR(fiscal_quarter_name, 3, 7)) AS fiscal_quarter_name_fy,
DENSE_RANK() OVER (ORDER BY fiscal_quarter_name) AS fiscal_quarter_number_absolute,
fiscal_year || '-' || MONTHNAME(date_day) AS fiscal_month_name,
('FY' || SUBSTR(fiscal_month_name, 3, 8)) AS fiscal_month_name_fy,
(CASE WHEN MONTH(date_day) = 1 AND DAYOFMONTH(date_day) = 1 THEN 'New Year''s Day'
WHEN MONTH(date_day) = 12 AND DAYOFMONTH(date_day) = 25 THEN 'Christmas Day'
WHEN MONTH(date_day) = 12 AND DAYOFMONTH(date_day) = 26 THEN 'Boxing Day'
ELSE NULL END)::VARCHAR AS holiday_desc,
(CASE WHEN HOLIDAY_DESC IS NULL THEN 0
ELSE 1 END)::BOOLEAN AS is_holiday,
DATE_TRUNC('month', last_day_of_fiscal_quarter) AS last_month_of_fiscal_quarter,
IFF(DATE_TRUNC('month', last_day_of_fiscal_quarter) = date_actual, TRUE, FALSE) AS is_first_day_of_last_month_of_fiscal_quarter,
DATE_TRUNC('month', last_day_of_fiscal_year) AS last_month_of_fiscal_year,
IFF(DATE_TRUNC('month', last_day_of_fiscal_year) = date_actual, TRUE, FALSE) AS is_first_day_of_last_month_of_fiscal_year,
DATEADD('day',7,DATEADD('month',1,first_day_of_month)) AS snapshot_date_fpa,
DATEADD('day',44,DATEADD('month',1,first_day_of_month)) AS snapshot_date_billings
FROM date_spine
), final AS (
SELECT
date_day,
date_actual,
day_name,
month_actual,
year_actual,
quarter_actual,
day_of_week,
first_day_of_week,
week_of_year,
day_of_month,
day_of_quarter,
day_of_year,
fiscal_year,
fiscal_quarter,
day_of_fiscal_quarter,
day_of_fiscal_year,
month_name,
first_day_of_month,
last_day_of_month,
first_day_of_year,
last_day_of_year,
first_day_of_quarter,
last_day_of_quarter,
first_day_of_fiscal_quarter,
last_day_of_fiscal_quarter,
first_day_of_fiscal_year,
last_day_of_fiscal_year,
week_of_fiscal_year,
month_of_fiscal_year,
last_day_of_week,
quarter_name,
fiscal_quarter_name,
fiscal_quarter_name_fy,
fiscal_quarter_number_absolute,
fiscal_month_name,
fiscal_month_name_fy,
holiday_desc,
is_holiday,
last_month_of_fiscal_quarter,
is_first_day_of_last_month_of_fiscal_quarter,
last_month_of_fiscal_year,
is_first_day_of_last_month_of_fiscal_year,
snapshot_date_fpa,
snapshot_date_billings
FROM calculated
)
** I believe the gitlab team uses Snowflake so if you're using another platform, you may need to change a few functions **

Computing session start and end using SQL window functions

I've a table of game logs containing a handDate, like this:
ID
handDate
1
2019-06-30 16:14:02.000
2
2019-07-12 06:18:02.000
3
...
I'd like to compute game sessions from this table (start and end), given that:
A new session is considered if there is no activity since 1 hour.
a session can exist across 2 days
So I'd like results like this:
day
session_start
sesssion_end
2019-06-30
2019-06-15 16:14:02.000
2019-06-15 16:54:02.000
2019-07-02
2019-07-02 16:18:02.000
2019-07-02 17:18:02.000
2019-07-02
2019-07-02 23:18:02.000
2019-07-03 03:18:02.000
2019-07-03
2019-07-03 06:18:02.000
2019-07-03 08:28:02.000
Currently I'm playing with the following code, but cannot achieve what I want:
SELECT *
FROM (
SELECT *,
strftime( '%s', handDate) - strftime( '%s', prev_event) AS inactivity
FROM (
SELECT handDate,
date( handDate) as day,
FIRST_VALUE( handDate) OVER (PARTITION BY date( handDate) ORDER BY handDate) AS first_event,
MIN(handDate) OVER (PARTITION BY date( handDate) ORDER BY handDate),
MAX(handDate) OVER (PARTITION BY date( handDate) ORDER BY handDate),
LAG( handDate) OVER (PARTITION BY date( handDate) ORDER BY handDate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS prev_event,
LEAD( handDate) OVER (PARTITION BY date( handDate) ORDER BY handDate) AS next_event
FROM hands
) last
) final
I'm using SQLite.
I found the following solution:
SELECT day,
sessionId,
MIN(handDate) as sessionStart,
MAX(handDate) as sessionEnd
FROM(
SELECT day,
handDate,
sum(is_new_session) over (
order by handDate rows between unbounded preceding and current row
) as sessionId
FROM (
SELECT *,
CASE
WHEN prev_event IS NULL
OR strftime('%s', handDate) - strftime('%s', prev_event) > 3600 THEN true
ELSE false
END AS is_new_session
FROM (
SELECT handDate,
date(handDate) as day,
LAG(handDate) OVER (
PARTITION BY date(handDate)
ORDER BY handDate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS prev_event
FROM hands
)
)
)
GROUP BY sessionId
DROP TABLE IF EXISTS hands;
CREATE TABLE hands(handDate TIMESTAMP);
INSERT INTO hands(handDate)
VALUES ('2021-10-29 10:30:00')
, ('2021-10-29 11:35:00')
, ('2021-10-29 11:36:00')
, ('2021-10-29 11:37:00')
, ('2021-10-29 12:38:00')
, ('2021-10-29 12:39:00')
, ('2021-10-29 12:39:10')
;
SELECT start_period, end_period
FROM (
SELECT is_start, handDate AS start_period
, CASE WHEN is_start AND is_end THEN handDate
ELSE LEAD(handDate) OVER (ORDER BY handDate)
END AS END_period
FROM (
SELECT *
FROM (
SELECT *
,CASE WHEN (event-prev_event) * 1440.0 > 60 OR prev_event IS NULL THEN true ELSE FALSE END AS is_start
,CASE WHEN (next_event-event) * 1440.0 > 60 OR next_event IS NULL THEN true ELSE FALSE END AS is_end
FROM (
SELECT handDate
, juliANDay(handDate) event
, juliANDay(LAG(handDate) OVER (ORDER BY handDate)) AS prev_event
, juliANDay(LEAD(handDate) OVER (ORDER BY handDate)) AS next_event
FROM hands
) t
) t
WHERE is_start OR is_end
)t
)t
WHERE is_start

convert rows into columns - Bigquery

I have a table like as shown below
As shown, I have two rows for the same subject. each row indicating a day
However, I wish to convert them into a single row like as shown below
Can you help? I did check this post but unable to translate it?
I did check this post but unable to translate it?
Let's first transform your original data into form that we then can pivot
Below does this:
#standardSQL
SELECT subject_id, hm_id, icu_id, balance,
DATE_DIFF(day, MIN(day) OVER(PARTITION BY subject_id, hm_id, icu_id), DAY) + 1 delta
FROM `project.dataset.table`
-- ORDER BY subject_id, hm_id, icu_id, delta
If to apply to sample data from your question - result is
Row subject_id hm_id icu_id balance delta
1 124 ab cd 2 1
2 124 ab cd 5 2
3 321 xy pq -6 1
4 321 xy pq 1 2
So, now we need to pivot this based on delta column - balance for delta = 1 will go to day_1_balance, balance for delta = 2 will go to day_2_balance and so on
Let's for now assume that there are just two deltas (as in your sample data). In this simplified case - below will make a trick
#standardSQL
SELECT subject_id, hm_id, icu_id,
MAX(IF(delta = 1, balance, NULL)) day_1_balance,
MAX(IF(delta = 2, balance, NULL)) day_2_balance
FROM (
SELECT subject_id, hm_id, icu_id, balance,
DATE_DIFF(day, MIN(day) OVER(PARTITION BY subject_id, hm_id, icu_id), DAY) + 1 delta
FROM `project.dataset.table`
)
GROUP BY subject_id, hm_id, icu_id
-- ORDER BY subject_id, hm_id, icu_id
with result
Row subject_id hm_id icu_id day_1_balance day_2_balance
1 124 ab cd 2 5
2 321 xy pq -6 1
Obviously, in real case you don't know how many delta columns you have so you need to build above query dynamically - and that is exactly where post you referenced - will help you
You can try again by yourself - or see below for final solution
Step 1 - generating query
#standardSQL
WITH temp AS (
SELECT subject_id, hm_id, icu_id, balance,
DATE_DIFF(day, MIN(day) OVER(PARTITION BY subject_id, hm_id, icu_id), DAY) + 1 delta
FROM `project.dataset.table`
)
SELECT CONCAT('SELECT subject_id, hm_id, icu_id,',
STRING_AGG(
CONCAT(' MAX(IF(delta = ',CAST(delta AS STRING),', balance, NULL)) as day_',CAST(delta AS STRING),'_balance')
)
,' FROM temp GROUP BY subject_id, hm_id, icu_id ORDER BY subject_id, hm_id, icu_id')
FROM (
SELECT delta
FROM temp
GROUP BY delta
ORDER BY delta
)
Result of step 1 is the text that represent final query that you need to run as step 2
Step 2 - run generated query
#standardSQL
WITH temp AS (
SELECT subject_id, hm_id, icu_id, balance,
DATE_DIFF(day, MIN(day) OVER(PARTITION BY subject_id, hm_id, icu_id), DAY) + 1 delta
FROM `project.dataset.table`
)
SELECT subject_id, hm_id, icu_id,
MAX(IF(delta = 1, balance, NULL)) AS day_1_balance,
MAX(IF(delta = 2, balance, NULL)) AS day_2_balance
FROM temp
GROUP BY subject_id, hm_id, icu_id
-- ORDER BY subject_id, hm_id, icu_id

Running Average of 3 Days Trading and then Populate that Value for Trading Holidays

I have a table with Data for Different Companies and Closing Price for Trading Days. I need to calculate a 3 Day Running Average for every Company. I need to then join with a calendar table to populate Close Price and Avg3DayClosePrice for all dates including Trading Holidays. For trading holidays the values should be of previous trading day.
Part of this is already answered in the post SQL for Dates with no ClosePrice for all companies
3 Day average before including Trading Holiday
select d.date as tdate, d.datekey, t.ticker, fsdc.ClosePrice as cp,
coalesce(fsdc.ClosePrice,
lag(fsdc.ClosePrice, 1) over (partition by t.ticker order by d.date),
lag(fsdc.ClosePrice, 2) over (partition by t.ticker order by d.date),
lag(fsdc.ClosePrice, 3) over (partition by t.ticker order by d.date),
lag(fsdc.ClosePrice, 4) over (partition by t.ticker order by d.date)
) as ClosePrice
-- This is the new addition
,AVG(fsdc.ClosePrice) OVER (partition by t.ticker order by d.date
ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING) AS avrg
-- new addition ends
from dimdates d join
(select ticker, min(datekey) as min_datekey
from factStockDividendCommodity fsdc
--where ticker <> 'BP'
group by ticker
) t
on d.datekey >= t.min_datekey left join
factStockDividendCommodity fsdc
on fsdc.ticker = t.ticker and
fsdc.datekey = d.datekey
where d.Date <= GETDATE()
order by ticker, d.Date;
Screenshot with issue
Updated Script:
select d.date as tdate, d.datekey, t.ticker, fsdc.ClosePrice as cp,
coalesce(fsdc.ClosePrice,
lag(fsdc.ClosePrice, 1) over (partition by t.ticker order by d.date),
lag(fsdc.ClosePrice, 2) over (partition by t.ticker order by d.date),
lag(fsdc.ClosePrice, 3) over (partition by t.ticker order by d.date),
lag(fsdc.ClosePrice, 4) over (partition by t.ticker order by d.date)
) as ClosePrice,
coalesce( lag(fsdc.ClosePrice, 1) over (partition by t.ticker order by d.date),
lag(fsdc.ClosePrice, 2) over (partition by t.ticker order by d.date),
lag(fsdc.ClosePrice, 3) over (partition by t.ticker order by d.date),
lag(fsdc.ClosePrice, 4) over (partition by t.ticker order by d.date),
lag(fsdc.ClosePrice, 5) over (partition by t.ticker order by d.date)
) as OpenPrice, av3,
coalesce(fsdc.av3,
lag(fsdc.av3, 1) over (partition by t.ticker order by d.date),
lag(fsdc.av3, 2) over (partition by t.ticker order by d.date),
lag(fsdc.av3, 3) over (partition by t.ticker order by d.date),
lag(fsdc.av3, 4) over (partition by t.ticker order by d.date)
) as Avg3
from
(select * from dimdates where datekey <=20181231) d join
(select ticker, min(datekey) as min_datekey
from factStockDividendCommodity
where ticker <> '5X10TR'
group by ticker
) t
on d.datekey >= t.min_datekey left join
(
select ticker, datekey, ClosePrice, AVG(ClosePrice) OVER (partition by ticker order by datekey
ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING) Av3
from factStockDividendCommodity
where ticker <> '5X10TR'
) fsdc
on fsdc.ticker = t.ticker and
fsdc.datekey = d.datekey
where d.Date <= GETDATE()
order by ticker, d.Date;
You can use average analytical function with preceding clause.
AVG(fsdc.ClosePrice ignore nulls) OVER (partition by t.ticker order by d.date
ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING) AS avrg
-- update --
You should use lag with ignore null as following to fetch values individyally
Lag(fsdc.ClosePrice,1) ignore nulls OVER (partition by t.ticker order by d.date) as prev1,
Lag(fsdc.ClosePrice,2) ignore nulls OVER (partition by t.ticker order by d.date) as prev2,
Lag(fsdc.ClosePrice,3) ignore nulls OVER (partition by t.ticker order by d.date) as prev3
Cheers!!