Snowflake SQL Range between interval - sql

I have input table with following structure -
ID,Date, Value.
I am trying to calculate minimum value in last 10 months for every record in dataset. For that I am using range between interval.
The code below is working fine in SPARK SQL but for some reason I can't use the same code in snowflake SQL. Appreciate if someone can guide me on how to modify the below code to run in Snowflake SQL.
select *,
min(avg_Value) OVER (
PARTITION BY ID
ORDER BY CAST(Date AS timestamp)
RANGE BETWEEN INTERVAL 10 MONTHS PRECEDING AND CURRENT ROW) as min_value_in_last_10_months
from
(
select ID,
Date,
avg(Value) as avg_Value
from table
group by ID,Date
)

Snowflake supports lateral joins, so one method is:
select . . .
from t cross join lateral
(select avg(t2.value) as avg_value
from t t2
where t2.id = t.id and
t2.date >= t.date - interval 10 month and
t2.date <= t.date
) a

Related

BigQuery: 'join lateral' alternative for referencing value in subquery

I have a BigQuery table that holds append-only data - each time an entity is updated a new version of it is inserted. Each entity has its unique ID and each entry has a timestamp of when it was inserted.
When querying for the latest version of the entity, I order by rank, partition by id, and select the most recent version.
I want to take advantage of this and chart the progression of these entities over time. For example, I would like to generate a row for each day since Jan. 1st, with a summary of the entities as they were on that day. In postgres, I would do:
select
...
from generate_series('2022-01-01'::timestamp, '2022-09-01'::timestamp, '1 day'::interval) query_date
left join lateral (
select *
from (
with snapshot as (
select distinct on (id) *
from table
where "createdOn" <= query_date
order by id, "createdOn" desc
)
This basically behaves like a for-each, having each subquery run once for each query_date (day, in this instance) which I can reference in the where clause. Each subquery then filters the data so that it only uses data up to a certain time.
I know that I can create a saved query for the "subquery" logic and then schedule a prefill to run once for each day over the timeline, but I would like to understand how to write an exploratory query.
EDIT 1
Using a correlated subquery is a step in the right direction, but does not work when the subquery needs to join with another table (another append-only table holding a related entity).
So this works:
select
day
, (
select count(*)
from `table` t
where date(createdOn) < day
)
from unnest((select generate_date_array(date('2022-01-01'), current_date(), interval 1 day) as day)) day
order by day desc
But if I need the subquery to join with another table, like in:
select
day
, (
select as struct *
from (
select
id
, status
, rank() over (partition by id order by createdOn desc) as rank
from `table1`
where date(createdOn) < day
qualify rank = 1
) t1
left join (
select
id
, other
, rank() over (partition by id order by createdOn desc) as rank
from `table2`
where date(createdOn) < day
qualify rank = 1
) t2 on t2.other = t1.id
)
from unnest((select generate_date_array(date('2022-01-01'), current_date(), interval 1 day) as day)) day
order by day desc
I get an error saying Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN. Another SO question about that error (Avoid correlated subqueries error in BigQuery) solves the issue by moving the correlated query to a join in the top query - which misses what I am trying to achieve.
Took me a while, but I figured out a way to do this using the answer in Bigquery: WHERE clause using column from outside the subquery.
Basically, it requires to flip the order of the queries, here's how it's done:
select *
from (
select *
from `table1` t1
JOIN (select day from unnest((select generate_timestamp_array(timestamp('2022-01-01'), current_timestamp(), interval 1 day) as day)) day) day
ON (t1.createdOn) < day.day
QUALIFY ROW_NUMBER() OVER (PARTITION BY day, t1.id ORDER BY t1.createdOn desc) = 1
)
left join (
select
* -- aggregate here
from (
SELECT
id, other, createdOn
FROM `table2` t2
JOIN (select day from unnest((select generate_timestamp_array(timestamp('2022-01-01'), current_timestamp(), interval 1 day) as day)) day) day
ON (t2.createdOn) < day.day
QUALIFY ROW_NUMBER() OVER (PARTITION BY day, t2.id ORDER BY t2.createdOn desc) = 1
) snapshot
group by rs.other, day
) t2 on t2.other = t1.id and t2.day = t1.day
group by t1.day

SQL: Select from another table (t2) without joining but referencing a column from t1

I have a table with columns date and net_sales. For each day, I want to get the sum of the net_sales for the last 30 days.
This is my query:
thirty_days_net_sales AS (
SELECT
t1.date,
t1.net_sales AS net_sales_on_date,
(SELECT SUM(t2.net_sales) FROM total_net_sales_per_day t2 WHERE t2.date >= DATE_SUB(t1.date, INTERVAL 30 DAY) AND t2.date <= t1.date)
FROM
total_net_sales_per_day t1)
When I run this query I get the error: LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.
I am using Google BigQuery. Thanks in advance for your help!
Consider rather below approach
select *, sum(net_sales) over win last_30_days
from total_net_sales_per_day
window win as (order by unix_date(date) range between 29 preceding and current row )
You would use window functions. If you have data for every day (as the name of the table implies:
SELECT tnspd.*
sum(netsales) over (partition by date
order by date
rows between -30 and current row
)
FROM total_net_sales_per_day tnspd;
As the error said, you have to add and equal condition. In case of using join you have to use ON keyword for one equal condition.
In your query, because you do not have join explicitly, you must have an equal condition something like I added below:
thirty_days_net_sales AS (
SELECT
t1.date,
t1.net_sales AS net_sales_on_date,
(SELECT SUM(t2.net_sales) FROM total_net_sales_per_day t2 WHERE t2.date >= DATE_SUB(t1.date, INTERVAL 30 DAY) AND t2.date <= t1.date
AND t1.id==t2.id)
FROM
total_net_sales_per_day t1)
This link might help for more information:
https://sql.info/d/solved-bigquery-left-outer-join-cannot-be-used-without-a-condition-that-is-an-equality-of-fields-from-both-sides-of-the-join

Self-referencing a table for previous record matching user ID

I'm trying to find the easiest way to calculate cycle times from SQL data. In the data source I have unique station ID's, user ID's, and a date/time stamp, along with other data they are performing.
What I want to do is join the table to itself so that for each date/time stamp I get:
- the date/time stamp of the most recent previous instance of that user ID within 3 minutes or null
- the difference between those two stamps (the cycle time = amount of time between records)
This should be simple but I can't wrap my brain around it. Any help?
Unfortunately SQL Server does not support date range specifications in window functions. I would recommend a lateral join here:
select
t.*,
t1.timestamp last_timestamp,
datediff(second, t1.timestamp, t.timestamp) diff_seconds
from mytable t
outer apply (
select top(1) t1.*
from mytable t1
where
t1.user_id = t.user_id
and t1.timestamp >= dateadd(minute, -3, t.timestamp)
and t1.timestamp < t.timestamp
order by t1.timestamp desc
) t1
The subquery brings the most recent row within 3 minutes for the same user_id (or an empty resultset, if there is no row within that timeframe). You can then use that information in the outer query to display the corresponding timestamp, and compute the difference with the current one.
Simply calculate the difference of the current and the LAG timestamp, if it's more than three minutes return NULL instead:
with cte as
(
select
t.*
,datediff(second, timestamp, lag(timestamp) over (partition by user_id order by timestamp) as diff_seconds
from mytable as t
)
select cte.*
,case when diff_seconds <= 180 then diff_seconds end
from cte

Windows functions orderen by date when some dates doesn't exist

Suppose this example query:
select
id
, date
, sum(var) over (partition by id order by date rows 30 preceding) as roll_sum
from tab
When some dates are not present on date column the window will not consider the unexistent dates. How could i make this windowns aggregation including these unexistent dates?
Many thanks!
You can join a sequence containing all dates from a desired interval.
select
*
from (
select
d.date,
q.id,
q.roll_sum
from unnest(sequence(date '2000-01-01', date '2030-12-31')) d
left join ( your_query ) q on q.date = d.date
) v
where v.date > (select min(my_date) from tab2)
and v.date < (select max(my_date) from tab2)
In standard SQL, you would typically use a window range specification, like:
select
id,
date,
sum(var) over (
partition by id
order by date
range interval '30' day preceding
) as roll_sum
from tab
However I am unsure that Presto supports this syntax. You can resort a correlated subquery instead:
select
id,
date,
(
select sum(var)
from tab t1
where
t1.id = t.id
and t1.date >= t.date - interval '30' day
and t1.date <= t.date
) roll_sum
from tab t
I don't think Presto support window functions with interval ranges. Alas. There is an old fashioned way to doing this, by counting "ins" and "outs" of values:
with t as (
select id, date, var, 1 as is_orig
from t
union all
select id, date + interval '30 day', -var, 0
from t
)
select id.*
from (select id, date, sum(var) over (partition by id order by date) as running_30,
sum(is_org) as is_orig
from t
group by id, date
) id
where is_orig > 0

Postgres windowing (determine contiguous days)

Using Postgres 9.3, I'm trying to count the number of contiguous days of a certain weather type. If we assume we have a regular time series and weather report:
date|weather
"2016-02-01";"Sunny"
"2016-02-02";"Cloudy"
"2016-02-03";"Snow"
"2016-02-04";"Snow"
"2016-02-05";"Cloudy"
"2016-02-06";"Sunny"
"2016-02-07";"Sunny"
"2016-02-08";"Sunny"
"2016-02-09";"Snow"
"2016-02-10";"Snow"
I want something count the contiguous days of the same weather. The results should look something like this:
date|weather|contiguous_days
"2016-02-01";"Sunny";1
"2016-02-02";"Cloudy";1
"2016-02-03";"Snow";1
"2016-02-04";"Snow";2
"2016-02-05";"Cloudy";1
"2016-02-06";"Sunny";1
"2016-02-07";"Sunny";2
"2016-02-08";"Sunny";3
"2016-02-09";"Snow";1
"2016-02-10";"Snow";2
I've been banging my head on this for a while trying to use windowing functions. At first, it seems like it should be no-brainer, but then I found out its much harder than expected.
Here is what I've tried...
Select date, weather, Row_Number() Over (partition by weather order by date)
from t_weather
Would it be better just easier to compare the current row to the next? How would you do that while maintaining a count? Any thoughts, ideas, or even solutions would be helpful!
-Kip
You need to identify the contiguous where the weather is the same. You can do this by adding a grouping identifier. There is a simple method: subtract a sequence of increasing numbers from the dates and it is constant for contiguous dates.
One you have the grouping, the rest is row_number():
Select date, weather,
Row_Number() Over (partition by weather, grp order by date)
from (select w.*,
(date - row_number() over (partition by weather order by date) * interval '1 day') as grp
from t_weather w
) w;
The SQL Fiddle is here.
I'm not sure what the query engine is going to do when scanning multiple times across the same data set (kinda like calculating area under a curve), but this works...
WITH v(date, weather) AS (
VALUES
('2016-02-01'::date,'Sunny'::text),
('2016-02-02','Cloudy'),
('2016-02-03','Snow'),
('2016-02-04','Snow'),
('2016-02-05','Cloudy'),
('2016-02-06','Sunny'),
('2016-02-07','Sunny'),
('2016-02-08','Sunny'),
('2016-02-09','Snow'),
('2016-02-10','Snow') ),
changes AS (
SELECT date,
weather,
CASE WHEN lag(weather) OVER () = weather THEN 1 ELSE 0 END change
FROM v)
SELECT date
, weather
,(SELECT count(weather) -- number of times the weather didn't change
FROM changes v2
WHERE v2.date <= v1.date AND v2.weather = v1.weather
AND v2.date >= ( -- bounded between changes of weather
SELECT max(date)
FROM changes v3
WHERE change = 0
AND v3.weather = v1.weather
AND v3.date <= v1.date) --<-- here's the expensive part
) curve
FROM changes v1
Here is another approach based off of this answer.
First we add a change column that is 1 or 0 depending on whether the weather is different or not from the previous day.
Then we introduce a group_nr column by summing the change over an order by date. This produces a unique group number for each sequence of consecutive same-weather days since the sum is only incremented on the first day of each sequence.
Finally we do a row_number() over (partition by group_nr order by date) to produce the running count per group.
select date, weather, row_number() over (partition by group_nr order by date)
from (
select *, sum(change) over (order by date) as group_nr
from (
select *, (weather != lag(weather,1,'') over (order by date))::int as change
from tmp_weather
) t1
) t2;
sqlfiddle (uses equivalent WITH syntax)
You can accomplish this with a recursive CTE as follows:
WITH RECURSIVE CTE_ConsecutiveDays AS
(
SELECT
my_date,
weather,
1 AS consecutive_days
FROM My_Table T
WHERE
NOT EXISTS (SELECT * FROM My_Table T2 WHERE T2.my_date = T.my_date - INTERVAL '1 day' AND T2.weather = T.weather)
UNION ALL
SELECT
T.my_date,
T.weather,
CD.consecutive_days + 1
FROM
CTE_ConsecutiveDays CD
INNER JOIN My_Table T ON
T.my_date = CD.my_date + INTERVAL '1 day' AND
T.weather = CD.weather
)
SELECT *
FROM CTE_ConsecutiveDays
ORDER BY my_date;
Here's the SQL Fiddle to test: http://www.sqlfiddle.com/#!15/383e5/3