SQL lag to row which meets condition - sql

I have a table which contains measures taken on random dates, partitioned by the site at which they were taken.
site
date
measurement
AB1234
2022-12-09
1
AB1234
2022-06-11
2
AB1234
2019-05-22
3
AB1234
2017-01-30
4
CD5678
2022-11-01
5
CD5678
2020-04-10
6
CD5678
2017-04-10
7
CD5678
2017-01-22
8
In order to calculate a year on year growth, I want to have an additional field for each record which contains the previous measurement at that site. The challenging part is that I only want the previous which occurred more than a year in the past.
Like so:
site
date
measurement
previous_measurement
AB1234
2022-12-09
1
3
AB1234
2022-06-11
2
3
AB1234
2019-05-22
3
4
AB1234
2017-01-30
4
NULL
CD5678
2022-11-01
5
6
CD5678
2020-04-10
6
7
CD5678
2017-04-10
7
NULL
CD5678
2017-01-22
8
NULL
It feels like it should be possible with a window function, but I can't work it out.
Please help :(

Amazon Athena engine version 3 incorporated from Trino. If it has incorporated full support for frame type RANGE for window functions you can use that:
-- sample data
with dataset(site, date, measurement) as (
values ('AB1234', date '2022-12-09', 1),
('AB1234', date '2022-06-11', 2),
('AB1234', date '2019-05-22', 3),
('AB1234', date '2017-01-30', 4),
('CD5678', date '2022-11-01', 5),
('CD5678', date '2020-04-10', 6),
('CD5678', date '2017-04-10', 7),
('CD5678', date '2017-01-22', 8)
)
-- query
select *,
last_value(measurement) over (
partition by site
order by date
RANGE BETWEEN UNBOUNDED PRECEDING AND interval '1' year PRECEDING)
from dataset;
Output:
site
date
measurement
_col3
CD5678
2017-01-22
8
NULL
CD5678
2017-04-10
7
NULL
CD5678
2020-04-10
6
7
CD5678
2022-11-01
5
6
AB1234
2017-01-30
4
NULL
AB1234
2019-05-22
3
4
AB1234
2022-06-11
2
3
AB1234
2022-12-09
1
3

Related

Get data from exactly 30 days ago only in SQL (Big Query)

In BigQuery I am trying to extract data from an exact date, 30 days ago, so that every day when I pull/refresh the data, it is always 30 days ago - no more, no less, however using the following it pulls in two dates:
SELECT FORMAT_DATE("%Y-%m-%d",createddatetime1) as dated, brand, orderid
FROM TABLE
AND createddatetime1 between TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY) AND TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 29 DAY)
I have tried different tactics, such as convert and cast, but I cant seem to pull data for one day only. createddatetime1 is formatted as "2022-08-02 23:53:57 UTC"
Example current output, you'll see two dates in there:
Row createddatetime1 brand orderid
1 2022-08-02 23:53:57 UTC ABC 1
2 2022-08-02 14:11:05 UTC ABC 2
3 2022-08-02 13:31:52 UTC ABC 3
4 2022-08-02 20:14:16 UTC ABC 4
5 2022-08-02 23:18:28 UTC ABC 5
6 2022-08-02 17:27:06 UTC ABC 6
7 2022-08-03 01:44:12 UTC ABC 7
8 2022-08-03 09:57:19 UTC ABC 8
9 2022-08-02 12:32:23 UTC ABC 9
10 2022-08-02 18:52:33 UTC ABC 10
Expected output:
Row createddatetime1 brand orderid
1 02/08/2022 ABC 1
2 02/08/2022 ABC 2
3 02/08/2022 ABC 3
4 02/08/2022 ABC 4
5 02/08/2022 ABC 5
6 02/08/2022 ABC 6
7 02/08/2022 ABC 7
8 02/08/2022 ABC 8
9 02/08/2022 ABC 9
10 02/08/2022 ABC 10
You're getting data for both dates as BETWEEN has both boundaries inclusive i.e. Both the start and end values are included. You need to extract the date from the timestamp column and use equality to filter the required rows.
This should work
where DATE(createddatetime1) = DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
This should work:
SELECT
Date(createddatetime1) as date, brand, orderid
FROM TABLE
where DATE(createddatetime1) = current_date()-30

How to Create table with Dates in range defined by table with start date inputs

I am trying to create a dates table in SQL based on a set of inputs, but I haven't been able to figure it out.
I am receiving in SQL inputs as below:
This table:
Date
Value
2022-01-01
5
2022-07-12
10
2022-11-15
3
A Start Date = 2022-01-01
A stop Date = 2022-12-01
I need to get a table as below starting from Start Date until Stop Date, assiging each correspondent number based on the initial table to each date in that period:
Date
Value
2022-01-01
5
2022-01-02
5
2022-01-03
5
2022-01-04
5
.
5
.
5
.
5
2022-07-09
5
2022-07-10
5
2022-07-11
5
2022-07-12
10
2022-07-13
10
2022-07-14
10
.
10
.
10
2022-11-13
10
2022-11-14
10
2022-11-15
3
2022-11-16
3
2022-11-17
3
2022-11-18
3
How can I do that?
Thanks.
Using the window function lead() over() in concert with an ad-hoc tally table
Example
Select Date = dateadd(DAY,N,A.Date)
,A.Value
From (
Select *
,nDays = datediff(DAY,Date,lead(Date,1,dateadd(day,1,'2022-12-01')) over (order by date))
From YourTable
) A
Join ( Select Top 1000 N=-1+Row_Number() Over (Order By (Select NULL)) From master..spt_values n1, master..spt_values n2 ) B
on N<NDays
Order by Date
Results
Date Value
2022-01-01 5
2022-01-02 5
2022-01-03 5
2022-01-04 5
2022-01-05 5
...
2022-07-10 5
2022-07-11 5
2022-07-12 10
2022-07-13 10
2022-07-14 10
...
2022-11-12 10
2022-11-13 10
2022-11-14 10
2022-11-15 3
2022-11-16 3
2022-11-17 3
...
2022-11-30 3
2022-12-01 3

How can I join two tables on an ID and a DATE RANGE in SQL

I have 2 query result tables containing records for different assessments. There are RAssessments and NAssessments which make up a complete review.
The aim is to eventually determine which reviews were completed. I would like to join the two tables on the ID, and on the date, HOWEVER the date each assessment is completed on may not be identical and may be several days apart, and some ID's may have more of an RAssessment than an NAssessment.
Therefore, I would like to join T1 on to T2 on ID & on T1Date(+ or - 7 days). There is no other way to match the two tables and to align the records other than using the date range, as this is a poorly designed database. I hope for some help with this as I am stumped.
Here is some sample data:
Table #1:
ID
RAssessmentDate
1
2020-01-03
1
2020-03-03
1
2020-05-03
2
2020-01-09
2
2020-04-09
3
2022-07-21
4
2020-06-30
4
2020-12-30
4
2021-06-30
4
2021-12-30
Table #2:
ID
NAssessmentDate
1
2020-01-07
1
2020-03-02
1
2020-05-03
2
2020-01-09
2
2020-07-06
2
2020-04-10
3
2022-07-21
4
2021-01-03
4
2021-06-28
4
2022-01-02
4
2022-06-26
I would like my end result table to look like this:
ID
RAssessmentDate
NAssessmentDate
1
2020-01-03
2020-01-07
1
2020-03-03
2020-03-02
1
2020-05-03
2020-05-03
2
2020-01-09
2020-01-09
2
2020-04-09
2020-04-10
2
NULL
2020-07-06
3
2022-07-21
2022-07-21
4
2020-06-30
NULL
4
2020-12-30
2021-01-03
4
2021-06-30
2021-06-28
4
2021-12-30
2022-01-02
4
NULL
2022-01-02
Try this:
SELECT
COALESCE(a.ID, b.ID) ID,
a.RAssessmentDate,
b.NAssessmentDate
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) RowId, *
FROM table1
) a
FULL OUTER JOIN (
SELECT
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) RowId, *
FROM table2
) b ON a.ID = b.ID AND a.RowId = b.RowId
WHERE (a.RAssessmentDate BETWEEN '2020-01-01' AND '2022-01-02')
OR (b.NAssessmentDate BETWEEN '2020-01-01' AND '2022-01-02')

link a value from one table to another and slice one table based on columns from another table in sql

Suppose I have a first table like this:
tbl1:
eventid date1 date2
A 2020-06-21 2020-06-28
B 2020-05-13 2020-05-24
C 2020-07-20 2020-06-28
I also have a second table with a quantity and a date:
tbl2:
quantity date
5 2020-06-24
13 2020-07-24
8 2020-07-28
8 2020-06-20
12 2020-06-27
9 2020-06-29
10 2020-05-24
11 2020-05-12
18 2020-05-18
9 2020-05-14
7 2020-07-18
12 2020-07-21
Now I want select only the rows from table 2 where the dates fall between the dates of table 1 AND to add a column to table with each row containing A, B or C (eventid from table 1) so that we can see which date in table 2 belongs to which eventid.
So my end result would look like:
quantity date eventid
5 2020-06-24 1
13 2020-07-24 3
8 2020-07-28 3
12 2020-06-27 1
10 2020-05-24 2
18 2020-05-18 2
9 2020-05-14 2
12 2020-07-21 3
I've been starring at it for ages now because I need an efficient way to do it..
Is there an efficient way of obtaining the desired result?
This looks like a join:
select t2.*, t1.eventid
from tbl2 t2 join
tbl1 t1
on t2.date >= t1.date1 and t2.date <= t2.date2;

Sum column values over a window based on variable date range (impala)

Given a table as follows :
client_id date connections
---------------------------------------
121438297 2018-01-03 0
121438297 2018-01-08 1
121438297 2018-01-10 3
121438297 2018-01-12 1
121438297 2018-01-19 7
363863811 2018-01-18 0
363863811 2018-01-30 5
363863811 2018-02-01 4
363863811 2018-02-10 0
I am looking for an efficient way to sum the number of connections that occur within x number of days following the current row (the current row being included in the sum), partitioned by client_id.
If x=6 then the output table would result in :
client_id date connections connections_within_6_days
---------------------------------------------------------------------
121438297 2018-01-03 0 1
121438297 2018-01-08 1 5
121438297 2018-01-10 3 4
121438297 2018-01-12 1 1
121438297 2018-01-19 7 7
363863811 2018-01-18 0 0
363863811 2018-01-30 5 9
363863811 2018-02-01 4 4
363863811 2018-02-10 0 0
Concerns :
I do not want to add all missing dates and then perform a sliding window counting the x following rows because my table is already extremely large.
I am using Impala and the range between interval 'x' days following and current row is not supported.
The generic solution is a bit troublesome for multiple periods, but you can use multiple CTEs to support that. The idea is to "unpivot" the counts based on when they go in and out and then use a cumulative sum.
So:
with conn as (
select client_id, date, connections
from t
union all
select client_id, date + interval 7 day, -connections
from t
),
conn1 as (
select client_id, date,
sum(sum(connections)) over (partition by client_id order by date) as connections_within_6_days
from t
group by client_id, date
)
select t.*, conn1. connections_within_6_days
from t join
conn1
on conn1.client_id = t.client_id and
conn1.date = t.date;