How can I join two tables using intervals in Google Big Query? - google-bigquery

You have identified a solution of finding an area within a bounding box /circle using cross join as below:
SELECT A.ID, C.Car
FROM Cars C
CROSS JOIN Areas A
WHERE C.Latitude BETWEEN A.LatitudeMin AND A.LatitudeMax AND
C.Longitude BETWEEN A.LongitudeMin AND A.LongitudeMax
at:
How to cross join in Big Query using intervals?
however, using cross join for large data sets is blocked by GBQ ops team due to constrains on the infrastructure.
Hence, my question: how could I find set of lat,longs within large data table (table A) that are within another set of bounding boxes , small(table B) ?
My query as below has been blocked:
select a.a1, a.a2 , a.mdl, b.name, count(1) count
from TableMaster a
CROSS JOIN places_locations b
where (a.lat
BETWEEN b.bottom_right_lat AND b.top_left_lat)
AND (a.long
BETWEEN b.top_left_long AND b.bottom_right_long)
group by ....
TableMaster is 538 GB with 6,658,716,712 rows (cleaned/absolute minimum)
places_locations varies per query around 5 to 100kb.
I have tried to adapt fake join based on a template:
How to improve performance of GeoIP query in BigQuery?
however, query takes an hour and does not produce any results nor any errors are displayed.
Could you identify a possible path to solve this puzzle at all?

The problem you're seeing is that the cross join generates too many intermediate values (6 billion x 1k = 6 trillion).
The way to work around this is to generate fewer outputs. If you have additional filters you can apply, you should try applying them before you do the join. If you could do the group by (or part of it) before the join, that would also help.
Moreover, for doing the lookup, you could do a more coarse-grained lookup first. That is, if you could do an initial cross join with a smaller table that has course grained regions, then you could join against the larger table on region id rather than doing a cross join.

okey so fake join does work at the end, solution:
` select a.B, a.C , count(1) count from ( SELECT B, C, A, lat, long from [GB_Data.PlacesMasterA] WHERE not B
is null) a
JOIN (SELECT top_left_lat, top_left_long, bottom_right_lat, bottom_right_long, A from
[Places.placeABOXA] ) b on a.A=b.A
where
(a.lat BETWEEN b.bottom_right_lat AND
b.top_left_lat) AND (a.long BETWEEN b.top_left_long AND
b.bottom_right_long) group each by B, C `

Related

Oracle: Use only few tables in WHERE clause but mentioned more tables in 'FROM' in a jon SQL

What will happen in an Oracle SQL join if I don't use all the tables in the WHERE clause that were mentioned in the FROM clause?
Example:
SELECT A.*
FROM A, B, C, D
WHERE A.col1 = B.col1;
Here I didn't use the C and D tables in the WHERE clause, even though I mentioned them in FROM. Is this OK? Are there any adverse performance issues?
It is poor practice to use that syntax at all. The FROM A,B,C,D syntax has been obsolete since 1992... more than 30 YEARS now. There's no excuse anymore. Instead, every join should always use the JOIN keyword, and specify any join conditions in the ON clause. The better way to write the query looks like this:
SELECT A.*
FROM A
INNER JOIN B ON A.col1 = B.col1
CROSS JOIN C
CROSS JOIN D;
Now we can also see what happens in the question. The query will still run if you fail to specify any conditions for certain tables, but it has the effect of using a CROSS JOIN: the results will include every possible combination of rows from every included relation (where the "A,B" part counts as one relation). If each of the three parts of those joins (A&B, C, D) have just 100 rows, the result set will have 1,000,000 rows (100 * 100 * 100). This is rarely going to give the results you expect or intend, and it's especially suspect when the SELECT clause isn't looking at any of the fields from the uncorrelated tables.
Any table lacking join definition will result in a Cartesian product - every row in the intermediate rowset before the join will match every row in the target table. So if you have 10,000 rows and it joins without any join predicate to a table of 10,000 rows, you will get 100,000,000 rows as a result. There are only a few rare circumstances where this is what you want. At very large volumes it can cause havoc for the database, and DBAs are likely to lock your account.
If you don't want to use a table, exclude it entirely from your SQL. If you can't for reason due to some constraint we don't know about, then include the proper join predicates to every table in your WHERE clause and simply don't list any of their columns in your SELECT clause. If there's a cost to the join and you don't need anything from it and again for some very strange reason can't leave the table out completely from your SQL (this does occasionally happen in reusable code), then you can disable the joins by making the predicates always false. Remember to use outer joins if you do this.
Native Oracle method:
WITH data AS (SELECT ROWNUM col FROM dual CONNECT BY LEVEL < 10) -- test data
SELECT A.*
FROM data a,
data b,
data c,
data d
WHERE a.col = b.col
AND DECODE('Y','Y',NULL,a.col) = c.col(+)
AND DECODE('Y','Y',NULL,a.col) = d.col(+)
ANSI style:
WITH data AS (SELECT ROWNUM col FROM dual CONNECT BY LEVEL < 10)
SELECT A.*
FROM data a
INNER JOIN data b ON a.col = b.col
LEFT OUTER JOIN data c ON DECODE('Y','Y',NULL,a.col) = b.col
LEFT OUTER JOIN data d ON DECODE('Y','Y',NULL,a.col) = d.col
You can plug in a variable for the first Y that you set to Y or N (e.g. var_disable_join). This will bypass the join and avoid both the associated performance penalty and the Cartesian product effect. But again, I want to reiterate, this is an advanced hack and is probably NOT what you need. Simply leaving out the unwanted tables it the right approach 95% of the time.

which is faster, left join the whole table, or left join a subselection of said table?

I have two tables (A & B) my objective is to have some columns from A left joined with a few columns of B (both tables have a LOT of columns)
is it faster to :
A) Select A -> left join -> subselect B:
(selecting only the desired columns BEFORE the join)
SELECT * FROM (
SELECT A.col_1,A.col_2,A.col3,A.col_b FROM A
LEFT JOIN (
SELECT B.col_1,B.col_2,B.col_a FROM B) B_temp
ON A.col_b = B_temp.col_a
B) Select A -> left join -B:
(selecting only the desired columns AFTER the join)
SELECT A.col_1,A.col_2,A.col3,B.col_1,B.col_2 FROM A
LEFT JOIN B
ON A.col_b = B_temp.col_a
My gut tells me even tho the second option is way more readable, it might be worse since it first aglutinates everything moving a lot of data around. My consideration for this is:
If the left join returns many results the simple-trivial approach (option B) might have to carry all these extra unecessary columns
Am I going in the right-way towards optimizing this sql query ?
Unless your SQL software is old and moldy, its query planner will handle your two example queries the same way.

In PostgreSQL, return rows with unique values of one column based on the minimum value of another

Background
I've got this PostgreSQL join that works pretty well for me:
select m.id,
m.zodiac_sign,
m.favorite_color,
m.state,
c.combined_id
from people."People" m
LEFT JOIN people.person_to_person_composite_crosstable c on m.id = c.id
As you can see, I'm joining two tables to bring in a combined_id, which I need for later analysis elsewhere.
The Goal
I'd like to write a query that does so by picking the combined_id that's got the lowest value of m.id next to it (along with the other variables too). This ought to result in a new table with unique/distinct values of combined_id.
The Problem
The issue is that the current query returns ~300 records, but I need it to return ~100. Why? Each combined_id has, on average, 3 different m.id's. I don't actually care about the m.id's; I care about getting a unique combined_id. Because of this, I decided that a good "selection criterion" would be to select rows based on the lowest value m.id for rows with the same combined_id.
What I've tried
I've consulted several posts on this and I feel like I'm fairly close. See for instance this one or this one. This other one does exactly what I need (with MAX instead of MIN) but he's asking for it in Unix Bash 😞
Here's an example of something I've tried:
select m.id,
m.zodiac_sign,
m.favorite_color,
m.state,
c.combined_id
from people."People" m
LEFT JOIN people.person_to_person_composite_crosstable c on m.id = c.id
WHERE m.id IN (select min(m.id))
This returns the error ERROR: aggregate functions are not allowed in WHERE.
Any ideas?
Postgres's DISTINCT ON is probably the best approach here:
SELECT DISTINCT ON (c.combined_id)
m.id,
m.zodiac_sign,
m.favorite_color,
m.state,
c.combined_id
FROM people."People" m
LEFT JOIN people.person_to_person_composite_crosstable c
ON m.id = c.id
ORDER BY
c.combined_id,
m.id;
As for performance, the following index on the crosstable might speed up the query:
CREATE INDEX idx ON people.person_to_person_composite_crosstable (id, combined_id);
If used, the above index should let the join happen faster. Note that I cover the combined_id column, which is required by the select.

Inner join and cross apply. How does it get evaluated?

If Inner join is: For each row in the left table, find row in the right table where the condition is met.
What's is cross apply? I have read that it's just inner join which gets evaluated row by row, but isn't inner join also evaluated row by row?
How do you explain cross apply in plain English? Is it just inner join but allows more complicated joins?
APPLY is different from JOIN in that it allows for correlated subqueries. For instance:
SELECT ...
FROM outer
APPLY (
SELECT ..
FROM inner WHERE outer.column = inner.column
)
At first this does not seems much of a difference, until you consider relational functions. Since APPLY accepts correlations from the other side, it means you can pass it values as arguments to the function:
SELECT ...
FROM outer
APPLY function(outer.column)
This is something not possible with JOIN.
The CROSS vs. OUTER is the same as with JOINs'.
Inner Join (or simply join):
Given 2 tables, A, and B, and a condition C that correlates A and B (most commonly, an equality relation between 2 fields, one from A, and one from B), joining table A with B based on C means that, for each row in A, check for rows in B where C is met - and return them.
Translating it into an example:
SELECT * FROM A inner join B on A.field1 = B.field5
Here, for each row in A, check for rows in B where A's field1 equals B's field5.
Return all such rows.
Cross Join:
A join that is not based on an explicit condition - rather - it combines every row from A with every row from B, and returns such rows.
Assuming A has 10 rows and B 20, you would get a result set of 200 rows.
Cross Apply: (which I have just learned about thanks to you :)
A cross apply is indeed related to a cross join, hence it bears "cross" in its name as well. What happens in a cross apply, as far as I understand, is:
Given a table A, and a function F, for each row selected by a given select statement from A, cross join it with the results of F.
Let's say A has 10 rows, and F is simply a function that returns 3 constant rows, like
1
2
3
For each one of the 10 rows from A, you will cross join the 3 resulting rows from F. Resulting in a result set of 30 rows.
Now, for which purpose was this statement created, I think I can't help much.
What I can think of, after reading some SO threads, is that it provides performance gains in such cross join operations (you could achieve the same results without using a function such as F and the "Cross-Apply").
This post provides an example of a scenario where such performance gain is achieved.

Google BigQuery; use subselect result in outer select; cross join

I have query that results into one row table and I need to get this result in subsequent computation. Here is non working simplified example (just to depict what I'm trying to achieve):
SELECT amount / (SELECT SUM(amount) FROM [...]) FROM [...]
I tried some nested sub-selects and joins (cross join of the one row table with the other table) but didn't find any working solution. Is there a way to get this working in BigQuery?
Thanks, Radek
EDIT:
ok, I found solution:
select
t1.x / t2.y as z
from
(select 1 as k, amount as x from [...] limit 10) as t1
join
(select 1 as k, sum(amount) as y from [...]) as t2
on
t1.k = t2.k;
but not sure if this is the best how to do it...
With the recently announced ratio_to_report() window function:
SELECT RATIO_TO_REPORT(amount) OVER() AS z
FROM [...]
ratio_to_report takes the amount, and divides it by the sum of all the result rows amounts.
The way you've found (essentially a cross join using a dummy key) is the best way I know of to do this query. We've thought about adding an explicit cross join operator to make it easier to see how to do this, but cross join can get expensive if not done correctly (e.g. if done on two large tables can create n^2 results).