Match nearest timestamp in Redshift SQL - sql

I have two tables, t1 and t2. For each id in t1 I have multiple records in t2. I want to match the closest timestamp of t2 to each record of t1. In t1 there is a flag, if it's 1 I want to match the closest timetamp of t2 that's smaller and if it's 0 I want to match the closest timestamp that is larger than that in t1.
So alltogether I have the following table:
T1
id, flag, timestamp
T2
id, timestamp
Is there an efficient way to do that?
Edit, here is some example:
T1
customer_id
timestamp_t1
flag
1
01.01.21 12:00
1
2
01.01.21 13:00
0
T2
customer_id
timestamp_t2
additional attributes
1
01.01.21 11:00
attribute1
1
01.01.21 10:00
attribute2
1
01.01.21 13:00
attribute3
2
01.01.21 11:00
attribute4
2
01.01.21 12:00
attribute5
2
01.01.21 14:00
attribute6
2
01.01.21 15:00
attribute7
Result:
customer_id
timetsamp_t1
timestamp_t2
flag
additional attributes
1
01.01.21 12:00
01.01.21 11:00
1
attribute1
2
01.01.21 13:00
01.01.21 14:00
0
attribute6
I hope this helps. As you can see. In the result, we matched 11:00 of T2 with 12:00 of T1 because the flag was 1 we chose the closest timestamp that was smaller than 12:00. We also matched 14:00 with 13:00, because the flag was 0 (so we matched the closest timestamp with id 2 that is larger than 13:00).

You could use correlated sub-queries to find the rows before/after the timestamp, and then use a CASE expression to pick which to join on...
SELECT
*
FROM
t1
INNER JOIN
t2
ON t2.id = CASE WHEN t1.flag = 1 THEN
(
SELECT t2.id
FROM t2
WHERE t2.customer_id = t1.customer_id
AND t2.timestamp_t2 <= t1.timestamp_t1
ORDER BY t2.timestamp DESC
LIMIT 1
)
ELSE
(
SELECT t2.id
FROM t2
WHERE t2.customer_id = t1.customer_id
AND t2.timestamp_t2 >= t1.timestamp_t1
ORDER BY t2.timestamp ASC
LIMIT 1
)
END
Oh, you haven't included an id column in your example, this works similarly...
SELECT
*
FROM
t1
INNER JOIN
t2
ON t2.customer_id = t1.customer_id
AND t2.timestamp_t2
=
CASE WHEN t1.flag = 1 THEN
(
SELECT MAX(t2.timestamp_t2)
FROM t2
WHERE t2.customer_id = t1.customer_id
AND t2.timestamp_t2 <= t1.timestamp_t1
)
ELSE
(
SELECT MIN(t2.timestamp_t2)
FROM t2
WHERE t2.customer_id = t1.customer_id
AND t2.timestamp_t2 >= t1.timestamp_t1
)
END

Related

Update column sequentially in Snowflake on historic data

I have table t1 which includes date column let say t1_date and a numeric column which is set to zero for all historic dates
Another table t2 includes a 2 dates column start_date and end_date
the sample data in t1 can be represented as :
t1_date
mo_sequqnce
2019-01-01
0
2019-01-02
0
2019-01-03
0
sample data in t2 can be represented as :
start_date
start_date
2019-01-01
2019-01-11
2019-02-01
2019-02-11
2019-03-01
2019-03-11
I have to sequentially increment mo_sequence column as per below query :
select t1.t1_date,
t2.start_date,
t2.end_date,t1.mo_sequence + ROW_NUMBER() OVER(ORDER BY SEQ8())
from t1
inner join t1 on t2.start_date = t1.t1_date
where t1.t1_date = t2.start_date and t1.t1_date between
t2.start_date and t2.end_date
t1_date
mo_sequnece
2019-01-01
1
2019-02-01
2
2019-03-01
3
But when I am updating it, I am not getting it sequentially.
update t1 set t1.mo_sequence = t3.sequen from
(select t1.t1_date,
t2.start_date,
t2.end_date,t1.mo_sequence + ROW_NUMBER() OVER(ORDER BY SEQ8()) as sequen
from t1
inner join t2 on t2.start_date = t1.t1_date
where t1.t1_date = t2.start_date and t1.t1_date between t2.start_date and
t2.end_date) t3

Bigquery select rows where the logtime is below min(value) of other table logtime

Let say I have the following two tables :
Table 1:
ID log_time
1 2013-10-12
1 2014-11-15
2 2013-12-21
2 2016-12-21
3 2015-09-21
3 2018-03-21
Table 2:
ID log_time
1 2011-10-12
1 2012-11-15
2 2012-12-21
2 2017-12-21
3 2014-09-21
3 2019-03-21
I want to get rows of Table 2 which are below min(log_time) of Table1 for each ID.
The result should be like this:
ID log_time
1 2011-10-12
1 2012-11-15
2 2012-12-21
3 2015-09-21
This is join and aggregation:
select t2.*
from table2 t2 join
(select t1.id, min(t1.log_time) as min_log_time
from table1 t1
group by t1.id
) t1
on t2.id = t.id and t2.timestamp < t1.timestamp;
You can also express this as a correlated subquery:
select t2.*
from table2 t2
where t2.log_time < (select min(t1.log_time) from t1 where t1.id = t2.id);
Note that both of these formulations will return no rows for ids missing from table1 (which is quite consistent with your question).

Reusing value for multiple dates in SQL

I have a table that looks like this
ID Type Change_Date
1 t1 2015-10-08
1 t2 2016-01-03
1 t3 2016-03-07
2 t1 2017-12-13
2 t2 2018-02-01
It shows if a customer has changed account type and when. However, I'd like a query that can give me the follow output
ID Type Change_Date
1 t1 2015-10
1 t1 2015-11
1 t1 2015-12
1 t2 2016-01
1 t2 2016-02
1 t3 2016-03
1 t3 2016-04
... ... ...
1 t3 2018-10
for each ID. The output shows what account type the customer had for each month until the current month. My problem is filling in the "empty" months. In some cases the interval between account changes can be more than a year.
I hope this makes sense.
Thanks in advance.
Base on Presto SQL(because your origin question is about Presto/SQL)
Update in 2018-11-01: use lead() to simplify SQL
Prepare data
Table mytable same as yours
id type update_date
1 t1 2015-10-08
1 t2 2016-01-03
1 t3 2016-03-07
2 t1 2017-12-13
2 t2 2018-02-01
Table t_month is a dictionary table which has all month data from 2015-01 to 2019-12. This kind of dictionary tables are useful.
ym
2015-01
2015-02
2015-03
2015-04
2015-05
2015-06
2015-07
2015-08
2015-09
...
2019-12
Add lifespan for mytable
Normally, your should 'manage' your data like their lifespan. So mytable should like
id type start_date end_date
1 t1 2015-10-08 2016-01-03
1 t2 2016-01-03 2016-03-07
1 t3 2016-03-07 null
2 t1 2017-12-13 2018-02-01
2 t2 2018-02-01 null
But in this case, you don't. So next step is 'create' one. Use lead() window function.
select
id,
type,
date_format(update_date, '%Y-%m') as start_month,
lead(
date_format(update_date, '%Y-%m'),
1, -- next one
date_format(current_date+interval '1' month, '%Y-%m') -- if null return next month
) over(partition by id order by update_date) as end_month
from mytable
Output
id type start_month end_month
1 t1 2015-10 2016-01
1 t2 2016-01 2016-03
1 t3 2016-03 2018-11
2 t1 2017-12 2018-02
2 t2 2018-02 2018-11
Cross join id and month
It's simple
with id_month as (
select * from t_month
cross join (select distinct id from mytable)
)
select * from id_month
Output
ym id
2015-01 1
2015-02 1
2015-03 1
...
2019-12 1
2015-01 2
2015-02 2
2015-03 2
...
2019-12 2
Finally
Now, you can use subquery in select clause
select
id,
type,
ym
from (
select
t1.id,
t1.ym,
(select type from mytable2 where t1.id = id and t1.ym >= start_month and t1.ym < end_month) as type
from id_month t1
)
where type is not null
-- order by id, ym
Full sql
with mytable2 as (
select
id,
type,
date_format(update_date, '%Y-%m') as start_month,
lead(
date_format(update_date, '%Y-%m'),
1, -- next one
date_format(current_date+interval '1' month, '%Y-%m') -- if null return next month
) over(partition by id order by update_date) as end_month
from mytable
)
, id_month as (
select * from t_month
cross join (select distinct id from mytable)
)
select
id,
type,
ym
from (
select
t1.id,
t1.ym,
(select type from mytable2 where t1.id = id and t1.ym >= start_month and t1.ym < end_month) as type
from id_month t1
)
where type is not null
--order by id, ym
Output
id type ym
1 t1 2015-10
1 t1 2015-11
1 t1 2015-12
1 t2 2016-01
1 t2 2016-02
1 t3 2016-03
1 t3 2016-04
...
1 t3 2018-10
2 t1 2017-12
2 t1 2018-01
2 t2 2018-02
...
2 t2 2018-10

MSSQL get rows which only differ at 2 columns

I have a task on which I have no idea how that could even work out.
I have to find records, which have a time difference of X and where a boolean is ON/OFF. I tried to use a LEFT OUTER JOIN and used the conditions in the ON clause, but it gave me the wrong result.
So my question is, how can I select rows, which have the same value in 2 columns, but different values in other 2 columns?
Edit:
My problem is, that for some reason my actual query returns the same entry multiple times. I checked if the entry exists multiple times, but it doesn't
Data for reference:
ID1 ID2 Boolean Time
1 1 0 2018-03-06 11:31:39
1 1 1 2018-03-06 11:33:39
2 1 0 2018-03-06 11:31:39
2 2 1 2018-03-06 11:40:39
The desired output from the query would be
ID1 ID2 Boolean Time
1 1 0 2018-03-06 11:31:39
1 1 1 2018-03-06 11:33:39
because ID1 and ID2 are the same, the Boolean is different and the time difference is in the specified range (lets say 5 minutes). The other 2 entries are not valid, because ID2 differs and the time difference is too big.
My current query:
select
t1.id1,
t1.id2,
t1.boolean,
t1.time
from t1 t1
left outer join t1 t2
on t1.boolean != t2.boolean and datediff(minute, t1.time, t2.time)<=5
where t1.id1 = t2.id1
and t1.id2 = t2.id2
Your query looks fine, I found few small issues
1- Table alias used is wrong instead of t it should be t1
2- Order or data is wrong
3- Changed left join to inner join
4- Modified ON and Where condition for better readability and performance
Check following corrected query.
WITH t1 AS
(
SELECT * FROM (VALUES
(1 , 1 , 0 , '2018-03-06 11:31:39'),
(1 , 1 , 1 , '2018-03-06 11:33:39'),
(2 , 1 , 0 , '2018-03-06 11:31:39'),
(2 , 2 , 1 , '2018-03-06 11:40:39')
) T( ID1, ID2 , Boolean, Time)
)
select
t1.id1,
t1.id2,
t1.boolean,
t1.time
from t1 t1
inner join t1 t2
on t1.id1 = t2.id1 and t1.id2 = t2.id2
where
t1.boolean != t2.boolean and datediff(minute, t1.time, t2.time)<=5
ORDER BY [TIME]
Output
+-----+-----+---------+---------------------+
| id1 | id2 | boolean | time |
+-----+-----+---------+---------------------+
| 1 | 1 | 0 | 2018-03-06 11:31:39 |
+-----+-----+---------+---------------------+
| 1 | 1 | 1 | 2018-03-06 11:33:39 |
+-----+-----+---------+---------------------+
To avoid duplicate value use GROUP BY
SELECT t1.id1
,t1.id2
,t1.boolean
,t1.TIME
FROM t1 t1
INNER JOIN t1 t2 ON t1.boolean != t2.boolean
AND datediff(minute, t1.TIME, t2.TIME) <= 5
WHERE t1.id1 = t2.id1
AND t1.id2 = t2.id2
GROUP BY t1.id1
,t1.id2
,t1.boolean
,t1.TIME
SELECT
D1.*
FROM
Data AS D1
WHERE
EXISTS (
SELECT
1
FROM
Data AS D2
WHERE
D1.ID1 = D2.ID2 AND
~D1.Boolean = D2.Boolean AND
ABS(DATEDIFF(MINUTE, D1.Time, D2.Time)) <= 5)
ORDER BY
D1.ID1,
D1.Boolean,
D1.Time

Waterfall join conditions

I have two tables similar to:
Table 1 --unique ID's
ID Date
1 3/8/2017
2 3/8/2017
3 3/8/2017
Table 2
ID Date SourceID
1 3/8/2017 1
1 3/8/2017 2
1 3/8/2017 3
2 3/8/2017 2
3 3/8/2017 1
3 3/8/2017 3
And I want to write a query that has a result like:
Result
ID SourceID
1 2
2 2
3 1
Where the source ID ordering should be 2, 1, 3
I have:
select Table1.ID
, COALESCE(Join1.SourceID, Join2.SourceID, Join3.SourceID) as SourceID
from Table1
left outer join Table2 Join1
on Table1.date = Join1.date
and Table1.ID = Join1.ID
and Join1.SourceID = 2
left outer join Table2 Join2
on Table1.date = Join2.date
and Table1.ID = Join2.ID
and Join2.SourceID = 1
and Join1.SourceID is null
left outer join Table2 Join3
on Table1.date = Join3.date
and Table1.ID = Join3.ID
and Join3.SourceID = 3
and Join1.SourceID is null
and Join2.SourceID is null
But this currently just keeps the records where sourceid = 2 and does not add in the other sourceid's.
Thanks in advance for any help. Let me know if you need any clarification. Using SQL-Server. I only need a few and fixed amount of sources so I am avoiding using a cursor.
This is a prioritization query. I would do it using outer apply:
select t1.*, t2.sourceId
from table1 t1 outer apply
(select top 1 t2.*
from table2 t2
where t2.id = t1.id and t2.date = t1.date
order by (case t2.sourceid when 2 then 1 when 1 then 2 when 3 then 3 end)
) t2;
Note: For readability, you can simplify the order by to:
order by charindex(cast(t2.sourceId as varchar(255)), '2,1,3')
If you are uncomfortable with outer apply, you can do the same thing with a single join:
select t1.*, t2.sourceId
from table1 t1 join
(select t2.*,
row_number() over (partition by id, date
order by (case t2.sourceid when 2 then 1 when 1 then 2 when 3 then 3 end)
) as seqnum
from table2 t2
) t2
on t2.id = t1.id and t2.date = t1.date and t2.seqnum = 1;