BigQuery BETWEEN JOIN - sql

I've got 2 tables in BigQuery that I'd like to join. Table 1 has integers, and table 2 has non-overlapping integer ranges (start, end). I'd like to join table 1 and 2 to give me something like this:
-- table 1
value
1
4
9
10
-- table 2
start, end
0,5
6,9
10,15
-- joined
value,start,end
1,0,5
4,0,5
9,6,9
10,10,15
I thought this query would work:
SELECT *
FROM
[table1] a
INNER JOIN [table2] b
ON a.value BETWEEN b.start AND b.end
But that gives me this error
ON clause must be AND of = comparisons of one field name from each
table, with all field names prefixed with table name
I can get the correct result with this CROSS JOIN query:
SELECT *
FROM
[table1] a
CROSS JOIN [table2] b
WHERE a.value BETWEEN b.start AND b.end
But the docs say this should be avoided if possible:
CROSS JOIN operations do not allow ON clauses. CROSS JOIN can return a
large amount of data and might result in a slow and inefficient query
or in a query that exceeds the maximum allowed per-query resources.
Such queries will fail with an error. When possible, prefer queries
that do not use CROSS JOIN
So, is it possible to do an INNER JOIN with a between, or improve the CROSS JOIN some other way?

This is a limitation for BigQuery Legacy SQL.
You should use BigQuery Standard SQL instead:
#standardSQL
SELECT *
FROM
`table1` a
INNER JOIN `table2` b
ON a.value BETWEEN b.start AND b.end
In standard SQL - you should use back-ticks instead of brackets.
Also keep in mind that end is a reserved keyword, so to make above work you need to enclose it in back-ticks also.
See below (along with dummy data from your question):
#standardSQL
WITH table1 AS (
SELECT value
FROM UNNEST([1, 4, 9, 10]) AS value
),
table2 AS (
SELECT chunk.start, chunk.`end`
FROM UNNEST([STRUCT<start INT64, `end` INT64>(0,5),(6,9),(10,15)]) AS chunk
)
SELECT *
FROM `table1` a
INNER JOIN `table2` b
ON a.value BETWEEN b.start AND b.`end`
-- ORDER BY value

Related

Better way to do corelated query having count in condition in AWS Athena sql

There are two table A and B. Table A has one to many relationship with B.
I want to fetch records from A and corresponding one single record from B (if B has one record),
If there is multiple record in Table B then find the one having status ='Active' find first.
Below is the query, running in oracle, but we want the same functionality running in AWS Athena, however correlated query is not supported in AWS athena sql. Athena supports ANSI Sql.
SELECT b.*
FROM A a ,B b
WHERE a.instruction_id = b.txn_report_instruction_id AND b.txn_report_instruction_id IN
(SELECT b2.txn_report_instruction_id FROM B b2
WHERE b2.txn_report_instruction_id=b.txn_report_instruction_id
GROUP BY b2.txn_report_instruction_id
HAVING COUNT(b2.txn_report_instruction_id)=1
)
UNION
SELECT * FROM
(SELECT b.*
FROM A a , B b
WHERE a.instruction_id = b.txn_report_instruction_id AND b.txn_report_instruction_id IN
(SELECT b2.txn_report_instruction_id
FROM B b2
WHERE b2.txn_report_instruction_id=b.txn_report_instruction_id
AND b2.status ='ACTIVE'
GROUP BY b2.txn_report_instruction_id
HAVING COUNT(b2.txn_report_instruction_id)> 1
)
)
We need to put all the field in select or in aggregate function when using group by so group by not preferable.
A help would be much appreciated.
[]
2
Output result table
Joining the best row can be achieved with a lateral join.
select *
from a
outer apply
(
select *
from b
where b.txn_report_instruction_id = a.instruction_id
order by case when b.status = 'ACTIVE' then 1 else 2 end
fetch first row only
) bb;
Another option is a window function:
select *
from a
left join
(
select
b.*,
row_number() over (partition by txn_report_instruction_id
order by case when status = 'ACTIVE' then 1 else 2 end) as rn
from b
) bb on bb.txn_report_instruction_id = a.instruction_id and bb.rn = 1;
I don't know about amazon athena's SQL coverage. This is all standard SQL, however, except for OUTER APPLY I think. If I am not mistaken, the SQL standard requires LEFT OUTER JOIN LATERAL (...) ON ... instead, for which you need a dummy ON clause, such as ON 1 = 1. So if above queries fail, there is another option for you :-)

Does wrapping my Coalesce in a subquery make my query more efficient or does it do nothing?

Lets say I have a query where one field can appear in either Table A or Table B but not both. So to retrieve it I use Coalesce.
Something like
Select
...
Coalesce(A.Number,B.Number) Number
...
From Table A
Left Join Table B on A.C= B.C
Now lets say I want to join another table to that Number field
should I just do
Join Table Z on Z.Z = Coalesce(A.Number,B.Number)
Or is it better to wrap my original table in a query and join on the definite result. So something like
Select * from (
Select
...
Coalesce(A.Number,B.Number) Number
...
From Table A
Left Join Table B on A.C= B.C
) T
left join Table Z on Z.Number= T.Number
Does this make a difference?
if i were joining another table to the result of the first query instead of a sub query i would place the first part in a CTE whenever possible, i believe the performance would be the same as a subquery but CTEs are more readable in my opinion.
with cte1 as
(
Select
...
Coalesce(A.Number,B.Number) Number
...
From Table A
Left Join Table B
on A.C= B.C
)
select *
from cte1 a
Join Table Z
on Z.Z = a.number

Query Optimization : using (Union instead of OR) and (exists instead of null)

i have a Query optimisation issue.
for the context, this query has always been running instantly
but today it took way more time. (3h+)
so i tried to fix it.
The query is Like -->
Select someCols from A
inner join B left join C
Where A.date = Today
And (A.col In ( Select Z.colseekedinA from tab Z) --A.col is the same column for
-- than below
OR
A.col In ( Select X.colseekedinA from tab X)
)
-- PART 1 ---
Select someCols from A
inner join B left join C -- takes 1 second 150 lines
Where A.date = Today
-- Part 2 ---
Select Z.colseekedinA from tab Z
OR -- Union -- takes 1 seconds 180 lines
Select X.colseekedinA from tab X
When i join now the two parts with the In, the query becomes incredibly long.
so i optimized it using union instead or OR and exists instead of in
but it still takes 3 minutes
i want to get it done again down to 5 seconds.
do you see some query issue ?
thank you
Using Union and Exists
Select someCols
from A
inner join B on a.col = b.col
left join C on b.col = c.col
Where A.date = Today
and exists(
Select Z.colseekedinA from tab Z where Z.colseekedinA = A.col
Union
Select X.colseekedinA from tab X where x.colseekedinA = A.col )
Also, if possible change below join to Left join.
inner join B on a.col = b.col
The exists approach may give spurious results as you will get rows that do not match either condition just if 1 row does match. This might be avoided by using exists within a correlated subquery but it isn't something I have experimented with enough to recommend.
For speed I'd go for a cross apply and specify the parent table within the cross apply expression (correlated subquery to create a derived table). That way the join condition is specified before the data is returned, if the columns in question have indexes on them (i.e. they are primary keys) then the optimiser can work out an efficient plan for this.
Union all is used within the cross apply expression as this prevents a distinct sort within the derived table which is generally heavier in terms of cost than bringing the data itself back (union has to identify all rows anyway including duplications).
Finally if this is still slow then potentially you might want to add an index to the date column in table a. This overcomes the lack of sargability inherent in a date column and means the optimiser can leverage the index rather than scanning all of the rows in the result set and testing whether or not the date equals today.
Select someCols from A
inner join B left join C
cross apply (Select Z.colseekedinA from tab Z where a.col=z.colseekedinA
union all
Select X.colseekedinA from tab X where a.col=x.colseekedina) d
Where A.date = Today
You code is confused but for the first part
You could try using a select UNION for the inner subquery ( these with OR )
and avoid the IN clause using a inner JOIN
Select someCols from A
inner join B
left join C
INNER JOIN (
Select Z.colseekedinA from tab Z
UNION
Select X.colseekedinA from tab X
) t on A.col = t.colseekedinA
Where A.date = Today

Standard SQL: LEFT JOIN by two conditions using BETWEEN

I have the following query in BigQuery:
#Standard SQL
SELECT *
FROM `Table_1`
LEFT JOIN `Table_2` ON (timestamp BETWEEN TimeStampStart AND TimeStampEnd)
But I get the following Error:
Error: LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.
If I use JOIN instead of LEFT JOIN, it works, but I want to keep all the rows from Table_1 (so also the ones which aren't matched to Table_2)
How to achieve this?
This is absolutely stupid... but the same query will work if you add a condition that matches a column from table1 with a column from table2:
WITH Table_1 AS (
SELECT CAST('2018-08-15' AS DATE) AS Timestamp, 'Foo' AS Foo
UNION ALL
SELECT CAST('2018-09-15' AS DATE), 'Foo'
), Table_2 AS (
SELECT CAST('2018-08-14' AS DATE) AS TimeStampStart, CAST('2018-08-16' AS DATE) AS TimeStampEnd, 'Foo' AS Bar
)
SELECT *
FROM Table_1
LEFT JOIN Table_2 ON Table_1.Foo = Table_2.Bar AND Table_1.Timestamp BETWEEN Table_2.TimeStampStart AND Table_2.TimeStampEnd
See if you have additional matching criteria that you can use (like another column that links table1 and table2 on equality).
A LEFT JOIN is always equivalent to the UNION of :
the INNER JOIN between the same two arguments on the same join predicate, and
the set of rows from the first argument for which no matching row is found (and properly extended with null values for all columns retained from the second argument)
That latter portion can be written as
SELECT T1.*, null as T2_C1, null as T2_C2, ...
FROM T1
WHERE NOT EXISTS (SELECT * FROM T2 WHERE )
So if you spell out the UNION you should be able to get there.
Interesting. This works for me in standard SQL:
select *
from (select 1 as x) a left join
(select 2 as a, 3 as b) b
on a.x between b.a and b.b
I suspect you are using legacy SQL. Such switch to standard SQL. (And drop the parentheses after the between.)
The problem is:
#(Standard SQL)#
This doesn't do anything. Use:
#StandardSQL
Hi as per the documentation, "(" has a special meaning, so please try without the brackets.
SELECT * FROM Table_1
LEFT JOIN Table_2 ON Table_1.timestamp >= Table_2.TimeStampStart AND Table_1.timestamp <= Table_2.TimeStampEnd
Documentation here

BigQuery Full outer join producing "left join" results

I have 2 tables, both of which contain distinct id values. Some of the id values might occur in both tables and some are unique to each table. Table1 has 10,910 rows and Table2 has 11,304 rows
When running a left join query:
SELECT COUNT(DISTINCT a.id)
FROM table1 a
JOIN table2 b on a.id = b.id
I get a total of 10,896 rows or 10,896 ids shared across both tables.
However, when I run a FULL OUTER JOIN on the 2 tables like this:
SELECT COUNT(DISTINCT a.id)
FROM table1 a
FULL OUTER JOIN EACH table2 b on a.id = b.id
I get total of 10,896 rows, but I was expecting all 10,910 rows from table1.
I am wondering if there is an issue with my query syntax.
As you are using EACH - it looks like you are running your queries in Legacy SQL mode.
In BigQuery Legacy SQL - COUNT(DISTINCT) function is probabilistic - gives statistical approximation and is not guaranteed to be exact.
You can use EXACT_COUNT_DISTINCT() function instead - this one gives you exact number but a little more expensive on back-end
Even better option - just use Standard SQL
For your specific query you will only need to remove EACH keyword and it should work as a charm
#standardSQL
SELECT COUNT(DISTINCT a.id)
FROM table1 a
JOIN table2 b on a.id = b.id
and
#standardSQL
SELECT COUNT(DISTINCT a.id)
FROM table1 a
FULL OUTER JOIN table2 b on a.id = b.id
I added the original query as a subquery and counted ids and produced the expected results. Still a little strange, but it works.
SELECT EXACT_COUNT_DISTINCT(a.id)
FROM
(SELECT a.id AS a.id,
b.id AS b.id
FROM table1 a FULL OUTER JOIN EACH table2 b on a.id = b.id))
It is because you count in both case the number of non-null lines for table a by using a count(distinct a.id).
Use a count(*) and it should works.
You will have to add coalesce... BigQuery, unlike traditional SQL does not recognize fields unless used explicitly
SELECT COUNT(DISTINCT coalesce(a.id,b.id))
FROM table1 a
FULL OUTER JOIN EACH table2 b on a.id = b.id
This query will now take full effect of full outer join :)