LEFT JOIN ON most recent date in Google BigQuery - sql

I've got two tables, both with timestamps and some more data:
Table A
| name | timestamp | a_data |
| ---- | ------------------- | ------ |
| 1 | 2018-01-01 11:10:00 | a |
| 2 | 2018-01-01 12:20:00 | b |
| 3 | 2018-01-01 13:30:00 | c |
Table B
| name | timestamp | b_data |
| ---- | ------------------- | ------ |
| 1 | 2018-01-01 11:00:00 | w |
| 2 | 2018-01-01 12:00:00 | x |
| 3 | 2018-01-01 13:00:00 | y |
| 3 | 2018-01-01 13:10:00 | y |
| 3 | 2018-01-01 13:10:00 | z |
What I want to do is
For each row in Table A LEFT JOIN the most recent record in Table B that predates it.
When there is more than one possibility take the last one
Target Result
| name | timestamp | a_data | b_data |
| ---- | ------------------- | ------ | ------ |
| 1 | 2018-01-01 11:10:00 | a | w |
| 2 | 2018-01-01 12:20:00 | b | x |
| 3 | 2018-01-01 13:30:00 | c | z | <-- note z, not y
I think this involves a subquery, but I cannot get this to work in Big Query. What I have so far:
SELECT a.a_data, b.b_data
FROM `table_a` AS a
LEFT JOIN `table_b` AS b
ON a.name = b.name
WHERE a.timestamp = (
SELECT max(timestamp) from `table_b` as sub
WHERE sub.name = b.name
AND sub.timestamp < a.timestamp
)
On my actual dataset, which is a very small test set (under 2Mb) the query runs but never completes. Any pointers much appreciated 👍🏻

You can try to use a select subquery.
SELECT a.*,(
SELECT MAX(b.b_data)
FROM `table_b` AS b
WHERE
a.name = b.name
and
b.timestamp < a.timestamp
) b_data
FROM `table_a` AS a
EDIT
Or you can try to use ROW_NUMBER window function in a subquery.
SELECT name,timestamp,a_data , b_data
FROM (
SELECT a.*,b.b_data,ROW_NUMBER() OVER(PARTITION BY a.name ORDER BY b.timestamp desc,b.name desc) rn
FROM `table_a` AS a
LEFT JOIN `table_b` AS b ON a.name = b.name AND b.timestamp < a.timestamp
) t1
WHERE rn = 1

Below is for BigQuery Standard SQL and does not require specifying all columns on both sides - only name and timestamp. So it will work for any number of the columns in both tables (assuming no ambiguity in name rather than for above mentioned two columns)
#standardSQL
SELECT a.*, b.* EXCEPT (name, timestamp)
FROM (
SELECT
ANY_VALUE(a) a,
ARRAY_AGG(b ORDER BY b.timestamp DESC LIMIT 1)[SAFE_OFFSET(0)] b
FROM `project.dataset.table_a` a
LEFT JOIN `project.dataset.table_b` b
USING (name)
WHERE a.timestamp > b.timestamp
GROUP BY TO_JSON_STRING(a)
)

In BigQuery, arrays are often an efficient way to solve such problems:
SELECT a.a_data, b.b_data
FROM `table_a` a LEFT JOIN
(SELECT b.name,
ARRAY_AGG(b.b_data ORDER BY b.timestamp DESC LIMIT 1)[OFFSET(1)] as b_data
FROM `table_b` b
GROUP BY b.name
) b
ON a.name = b.name;

this is a common case where you can't just Group by and get the minimum. I suggest the following:
SELECT *
FROM table_a as a inner join (SELECT name, min(timestamp) as timestamp
FROM table_b group by 1) as b
on (a.timestamp = b.timestamp and a.name = b.name)
This way you limit it only to the minimum present in Table b, as you specified.
You can also achieve that in a more readable way using the WITH statement:
WITH min_b as (
SELECT name,
min(timestamp) as timestamp
FROM table_b group by 1
)
SELECT *
FROM table_a as a inner join min_b
on (a.timestamp = min_b.timestamp and a.name = min_b.name)
Let me know if it worked!

Related

How to select only 1 row from ordered table for each ID?

This is my SQL code:
SELECT a.ID
, a.Date
, a.Value
, b.Alias
FROM NAV a
LEFT JOIN Portfolio b ON a.ID = b.ID
ORDER BY a.ID, a.Date DESC, b.Alias, a.Value
It gives me a table that looks something like this:
| ID | Date | Value | Alias |
|----|------|-------|-------|
| 1 | 2021 | 300 | A |
| 1 | 2020 | 200 | A |
| 1 | 2019 | 400 | A |
| 2 | 2021 | 800 | B |
| 2 | 2020 | 700 | B |
| 3 | 2021 | 600 | C |
| 3 | 2019 | 300 | C |
| 3 | 2018 | 500 | C |
I want to only choose the most first row for each ID. How would I go about doing that? Apologies for the basic question, am new to SQL.
You can use row_number():
SELECT n.ID, n.Date, n.Value, p.Alias
FROM (SELECT n.*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY Date DESC) as seqnum
FROM NAV n
) n LEFT JOIN
Portfolio p
ON p.ID = n.ID
WHERE seqnum = 1
ORDER BY n.ID;
Note:
Use meaningful table aliases instead of arbitrary letters.
I doubt a LEFT JOIN is needed. Are there really values of ID in NAV that are not in Portfolio?
ANSI standard SQL (assuming this returns the correct results)
Call NAV table twice -
First time - get the row of interest (table aliased as c
then use that row to get the relevant a.value by joining on both c.ID and c.Date to a.ID and a.Date
SELECT c.ID, c.Date, a.Value, b.Alias
FROM (select Id, max(date) as Date from NAV group by Id) c
inner join Nav a on a.ID = c.Id and a.date = c.date
left join Portfolio b on b.id = a.id
ORDER BY c.ID, c.Date DESC, b.Alias, a.Value;

Split a date range in SQL Server

I'm struggling with a solution for a problem but I couldn't find anything similar here.
I have a table "A" like:
+---------+------------+------------+-----------+
| user_id | from | to | attribute |
+---------+------------+------------+-----------+
| 1 | 2020-01-01 | 2020-12-31 | abc |
+---------+------------+------------+-----------+
and I get a table "B" like:
+---------+------------+------------+-----------+
| user_id | from | to | attribute |
+---------+------------+------------+-----------+
| 1 | 2020-03-01 | 2020-04-15 | def |
+---------+------------+------------+-----------+
And what I need is:
+---------+------------+------------+-----------+
| user_id | from | to | attribute |
+---------+------------+------------+-----------+
| 1 | 2020-01-01 | 2020-02-29 | abc |
| 1 | 2020-03-01 | 2020-04-15 | def |
| 1 | 2020-04-16 | 2020-12-31 | abc |
+---------+------------+------------+-----------+
I tried just using insert and update but I couldn't figure out how to simultaneously do both. Is there a much simpler way? I read about CTE, could this be an approach?
I'd be very thankful for your help!
Edit: more examples
TABLE A
| user_id | from | to | attribute |
+=========+============+============+===========+
| 1 | 2020-01-01 | 2020-12-31 | atr1 |
| 1 | 2021-01-01 | 2021-12-31 | atr2 |
| 2 | 2020-01-01 | 2021-06-15 | atr1 |
| 3 | 2020-01-01 | 2021-06-15 | atr3 |
TABLE B
| user_id | from | to | attribute |
+=========+============+============+===========+
| 1 | 2020-09-01 | 2021-02-15 | atr3 |
| 2 | 2020-04-15 | 2020-05-31 | atr2 |
| 3 | 2021-04-01 | 2022-01-01 | atr1 |
OUTPUT:
| user_id | from | to | attribute |
+=========+============+============+===========+
| 1 | 2020-01-01 | 2020-08-31 | atr1 |
| 1 | 2020-09-01 | 2021-02-15 | atr3 |
| 1 | 2021-02-16 | 2021-12-31 | atr2 |
| 2 | 2020-01-01 | 2020-04-14 | atr1 |
| 2 | 2020-04-15 | 2020-05-31 | atr2 |
| 2 | 2020-06-01 | 2021-06-15 | atr1 |
| 3 | 2020-01-01 | 2021-03-31 | atr3 |
| 3 | 2021-04-01 | 2022-01-01 | atr1 |
Initially I just asked to split the date range and make a new row because the new attribute of table B is between the one in table A. But it's only a part of the problem. Maybe it's more clear with the new dataset(?)
Sample data,
create table #TableA( userid int, fromdt date
,todt date, attribute varchar(10))
insert into #TableA (userid , fromdt , todt , attribute)
values
( 1 ,'2020-01-01','2020-12-31' , 'atr1' ),
( 1 ,'2021-01-01','2021-12-31' , 'atr2' ),
( 2 ,'2020-01-01','2021-06-15' , 'atr1' ),
( 3 ,'2020-01-01','2021-06-15' , 'atr3' )
create table #TableB( userid int,fromdt date
,todt date, attribute varchar(10))
insert into #TableB (userid,fromdt, todt, attribute)
values
( 1 ,'2020-09-01','2021-02-15' , 'atr3' ),
( 2 ,'2020-04-15','2020-05-31' , 'atr2' ),
( 3 ,'2021-04-01','2022-01-01' , 'atr1' )
;
The Script,
;WITH CTE
AS (
SELECT *
FROM #TableA
UNION ALL
SELECT *
FROM #TableB
)
,CTE2
AS (
SELECT userid
,min(fromdt) minfromdt
,max(todt) maxtodt
FROM CTE
GROUP BY userid
)
,CTE3
AS (
SELECT c.userid
,c.fromdt
,c.todt
,c.attribute
,LEAD(c.fromdt, 1) OVER (
PARTITION BY c.userid ORDER BY c.fromdt
) LeadFromdt
FROM CTE c
)
,CTE4
AS (
SELECT c3.userid
,c3.fromdt
,CASE
WHEN c3.todt > c3.LeadFromdt
THEN dateadd(day, - 1, c3.leadfromdt)
--when c3.todt<c3.LeadFromdt then dateadd(day,-1,c3.leadfromdt)
ELSE c3.todt
END AS Todt
,
--c3.todt as todt1,
c3.attribute
FROM CTE3 c3
)
,CTE5
AS (
SELECT userid
,fromdt
,todt
,attribute
FROM CTE4
UNION ALL
SELECT c2.userid
,dateadd(day, 1, c4.Todt) AS Fromdt
,maxtodt AS Todt
,c4.attribute
FROM CTE2 c2
CROSS APPLY (
SELECT TOP 1 c4.todt
,c4.attribute
FROM cte4 c4
WHERE c2.userid = c4.userid
ORDER BY c4.Todt DESC
) c4
WHERE c2.maxtodt > c4.Todt
)
SELECT *
FROM CTE5
ORDER BY userid
,fromdt
drop table #TableA, #TableB
Your output is wrong.
Also append other sample data in same example
where my script is not working.
The easiest way is to work with a calendar table. You can create one and reuse it later.
When you have one (here I called it "AllDates"), you can do something like this:
WITH cte
as
(
select ad.theDate,u.userid,isnull(b.attrib,a.attrib) as attrib,
ROW_NUMBER() over (PARTITION BY u.userid, isnull(b.attrib,a.attrib)ORDER BY ad.theDate)
- ROW_NUMBER() over (PARTITION BY u.userid ORDER BY ad.theDate) as grp
from AllDates ad
cross join (select userid from tableA union select userid from tableB) u
left join tableB b on ad.theDate between b.frm and b.toD and u.userid = b.userid
left join tableA a on ad.theDate between a.frm and a.toD and u.userid = a.userid
where b.frm is not null
or a.frm is not null
)
SELECT userid,attrib,min(theDate) as frmD, max(theDate) as toD
FROM cte
GROUP BY userid,attrib,grp
ORDER BY 1,3;
If I understand the request correctly the data from table A should be merged into table B to fill the gaps based on four scenarios, here is how I achieved it:
/*
Scenario 1 - Use dates from B as base to be filled in from A
- Start and end dates from B
*/
SELECT
B.UserId,
B.StartDate,
B.EndDate,
B.Attr
FROM #tmpB AS B
UNION
/*
Scenario 2 - Start date between start and end date of another record
- End date from B plus one day as start date
- End date from A as end date
*/
SELECT
B.UserId,
DATEADD(DD, 1, B.EndDate) AS StartDate,
A.EndDate,
A.Attr
FROM #tmpB AS B
JOIN #tmpA AS A ON
B.UserId = A.UserId
AND B.StartDate < A.StartDate
AND B.EndDate > A.StartDate
UNION
/*
Scenario 3 - End date between start and end date of another record or both dates between start and end date of another record
- Start date from A as start date
- Start date from B minus one as end date
*/
SELECT
B.UserId,
A.StartDate,
DATEADD(DD, -1, B.StartDate) AS EndDate,
A.Attr
FROM #tmpB AS B
JOIN #tmpA AS A ON
B.UserId = A.UserId
AND (B.StartDate < A.EndDate AND B.EndDate > A.EndDate
OR B.StartDate BETWEEN A.StartDate AND A.EndDate AND B.EndDate BETWEEN A.StartDate AND A.EndDate)
UNION
/*
Scenario 4 - Both dates between start and end date of another record
- End date from B minus one as start date
- End date from A as end date
*/
SELECT
B.UserId,
DATEADD(DD, -1, B.EndDate) AS StartDate,
A.EndDate,
A.Attr
FROM #tmpB AS B
JOIN #tmpA AS A ON
B.UserId = A.UserId
AND B.StartDate BETWEEN A.StartDate AND A.EndDate
AND B.EndDate BETWEEN A.StartDate AND A.EndDate

generating value rows in between dates

I have a data table that lists id changes on a given date. Structure is the following (Table A):
+----------------------------------------------------------+
| person current_id previous_id action date |
+----------------------------------------------------------+
| A 1 0 'id assignment' 2019-01-01 |
| B 2 1 'id change' 2019-01-03 |
| A 2 1 'id change' 2019-01-02 |
| C 4 2 'id change' 2019-01-03 |
| ... ... ... ... ... |
+----------------------------------------------------------+
However Table A provides a date only if there was a change on that date.
For a traceability study, I am trying to create a data table (Table B below) using Table A. Each day should contain the corresponding id for the existing people in that table (using hive).
Something like this (Table B):
+---------------------------+
| date person id |
+---------------------------+
| 2019-01-01 A 1 |
| 2019-01-01 B 1 |
| 2019-01-01 C 2 |
| 2019-01-02 A 2 |
| 2019-01-02 B 1 |
| 2019-01-02 C 2 |
| 2019-01-03 A 2 |
| 2019-01-03 B 2 |
| 2019-01-03 C 4 |
| ... ... ... |
+---------------------------+
All I can do is getting time independent current ids for mentioned people. I have no idea where to start on generating the output table. Cannot establish the logic.
Thanks in advance for your help!
First, you need to generate the rows. Assuming that you have at least one change on each day, you can use a cross join.
Then you need to impute the value on each days. The simplest method would use lag() with the ignore nulls option, but I don't think Hive supports this.
Instead, two levels of window functions can work:
select person, date,
coalesce(current_id,
max(current_id) over (partition by person, id_date)
) as id
from (select p.person, d.date, a.current_id,
max(case when a.current_id is not null then d.date end) over (partition by p.person order by d.date) as id_date
from (select distinct person from tablea a) p cross join
(select distinct date from tablea a) d left join
tablea a
on p.person = a.person and d.date = a.date
) pd;
If you cannot use cross join, perhaps this will work:
from (select distinct person, 1 as joinkey from tablea a) p join
(select distinct date, 1 as joinkey from tablea a) d
on p.joinkey = d.joinkey left join
tablea a
on p.person = a.person and d.date = a.date

SQL query For Latest date/time Stamp record for each ID

Please help to sort below list TABLE,
ID NAME DATE TIME STATUS
ID is unique, Name, Date, Time, Status keeps changing in database.
I need output list, having Latest STATUS, DATE AND TIME stamps for each user ID
I would use window functions for this:
select t.*
from (select t.*,
row_number() over (partition by id order by date desc, time desc) as seqnum
from t
) t
where seqnum = 1;
Alternatively, if you have a table with one row per customer, then apply might be best:
select t.*
from customers c cross apply
(select top (1) t.*
from t
where t.id = c.id
order by date desc, time desc
) t;
How about
SELECT T1.*
FROM T T1 INNER JOIN
(
SELECT ID,
CName,
MAX(CDate) CDate,
MAX(CTime) CTime
FROM T
GROUP BY ID,
CName
) T2
ON T1.CDate = T2.CDate
AND
T1.CTime = T2.CTime
AND T1.CName = T2.CName;
Which will return
+---------------------+----------+--------+----+-------+
| CDate | CTime | Status | ID | CName |
+---------------------+----------+--------+----+-------+
| 22/12/2018 00:00:00 | 16:27:57 | 1 | 1 | A |
| 21/12/2018 00:00:00 | 15:41:13 | 4 | 2 | B |
| 20/12/2018 00:00:00 | 12:35:27 | 3 | 2 | C |
| 21/12/2018 00:00:00 | 15:29:46 | 4 | 3 | D |
+---------------------+----------+--------+----+-------+
OR
SELECT T1.*
FROM T T1 INNER JOIN
(
SELECT ID,
MAX(CDate) CDate,
MAX(CTime) CTime
FROM T
GROUP BY ID
) T2
ON T1.CDate = T2.CDate
AND
T1.CTime = T2.CTime;
Which will return
+---------------------+----------+--------+----+-------+
| CDate | CTime | Status | ID | CName |
+---------------------+----------+--------+----+-------+
| 22/12/2018 00:00:00 | 16:27:57 | 1 | 1 | A |
| 21/12/2018 00:00:00 | 15:41:13 | 4 | 2 | B |
| 21/12/2018 00:00:00 | 15:29:46 | 4 | 3 | D |
+---------------------+----------+--------+----+-------+
Demo
SELECT * FROM table
WHERE C_Time =
(SELECT max(C_Time) FROM table t1 WHERE C_Date =
(SELECT max(C_Date) FROM table t2 WHERE t1.ID = t2.ID)
);
This gives you the entries for the highest C_Date and C_Time values for every ID

SQL Server get ids in one table on Date criteria in table one and table two

I am having problems getting the ids in TABLE A that satisfies the following criteria (I have tried a lot of different things and looked at various SO answers but cannot make it work. I looked into using OVER(PARTITION BY TABLE_B.calendar)):
Open (TABLE_B) should be equal to 1 on the first calendarDay (TABLE_B) on or after 10 days after startDate(TABLE_A).
endDate (TABLE_A) should be equal to the day found in 1) (i.e. the calendarDay for the respective id that satisfies the criteria).
Sample data:
TABLE_A:
+----+------------+------------+
| id | startDate | endDate |
+----+------------+------------+
| 1 | 2011-02-14 | 2011-03-14 |
| 2 | 2012-12-19 | 2013-01-20 |
| 3 | 2014-12-19 | 2015-01-21 |
+----+------------+------------+
TABLE_B:
+-------------+------+
| calendarDay | open |
+-------------+------+
| 2011-03-14 | 1 |
| 2011-03-16 | 0 |
| 2013-01-20 | 1 |
| 2013-01-21 | 1 |
| 2015-01-21 | 0 |
| 2015-01-22 | 1 |
+-------------+------+
Desired result:
+----+------------+------------+
| id | startDate | endDate |
+----+------------+------------+
| 1 | 2011-02-14 | 2011-03-14 |
| 2 | 2012-12-19 | 2013-01-20 |
+----+------------+------------+
I think you want:
select a.*
from a cross apply
(select top (1) b.*
from b
where b.open = 1 and b.calendarDate >= dateadd(day, 10, a.startdate)
order by b.calendarDate asc
) b
where b.calendarDate = a.endDate
You could use a CTE to first get the first calendar day:
with cteId(n, id, [open])
as (
select ROW_NUMBER() over (partition by a.id order by b.calendarDay) n, a.id, b.[open]
from #TABLE_A a
inner join #TABLE_B b on b.calendarDay >= DATEADD(day, 10, a.startDate)
)
... then just join it with TABLE_A
select a.*
from #TABLE_A a
inner join cteId c on a.id = c.id
where c.n = 1 and c.[open] = 1
You can try this query.
Use Exists
select a.*
from TABLE_A as a
where exists(
SELECT 1
FROM TABLE_B b
where
a.startDate <= DateAdd(day, 10, b.calendarDay) and b.[open] = 1
)
and exists(
SELECT 1
FROM TABLE_B b
where
a.endDate = b.calendarDay and b.[open] = 1
)
sqlfiddle:http://sqlfiddle.com/#!18/320111/15
another way can try to use join
select a.*
from TABLE_A as a
INNER JOIN
(
SELECT b.*,DateAdd(day, 10, b.calendarDay) addDay
FROM TABLE_B b
where b.[open] = 1
) b on a.startDate <= addDay and a.endDate = b.calendarDay
sqlfiddle:http://sqlfiddle.com/#!18/320111/19
you could probably use Exists here to look for the matching value
Select *
From Table_A a
Where Exists (
Select 1
From Table_B b
Where b.[open] = 1
And b.calendarDay >= DateAdd(dd, 10, a.endDate)
And b.calendarDay = a.endDate)
)