Why join pruning is not there in Google-BigQuery? - google-bigquery

I'm new to BigQuery.
Tried one SQL(with standard SQL) in BigQuery which is basically like:
select T1.C1,T1.C2
from T1 left outer join T2
on T1.C1 = T2.C3
For other RDMS as I know, a join pruning will happen, which means the left outer join will not happen at execution time, since no result field/column is from T2 (right side of the left outer join).
However, in BigQuery, based on execution details of query history, the left outer join always happens, no matter result set contains fields from right side or not.
Could anyone suggest if BigQuery is designed like this?
Any way to avoid the join when no field selected from right side?
Thanks!
Matt

Related

joining three tables between each other

Is it possible to join three tables in this way .
select T1.[...],T2.[...],T3.[...]
from T1
full outer join T2 on T1.[key]=T2.[key]
full outer join T3 on T1.[key]=T3.[key]
full outer join T2 on T2.[key]=T3.[key]
My question is : Is this a valid Form?
And if no is there a way to do such operation?
It is "valid" but the full joins are not correct. The on conditions will change them to some other type of join.
Your query has other errors. But I speculate that you want:
select T1.[...], T2.[...], T3.[...]
from T1 full join
T2
on T2.[key] = T1.[key] full join
T3 join
on T3.[key] = coalesce(T2.[key], T1.[key]);
It is possible to join three tables, and your example could run with some changes, but you have syntax and scoping errors in the FROM clause.
Even those aside, I don't think it will do what you intend it to do. You'll probably want to use GROUP BY
See the examples / discussion here :
Multiple FULL OUTER JOIN on multiple tables
I also used this site as a source, as its been a while since I've touched SQL, it may be helpful to you also :
https://learnsql.com/blog/how-to-join-3-tables-or-more-in-sql/

Cross join behaviour (SQLServer 2008)

I have been trying to track down a problem with a query I have. The query is actually generated by hibernate from HQL but the resulting SQL doesn't do what I expect. Modifying the SQL slightly produces the correct result but I'm not sure why the modification should make any difference.
Original query (returns no rows)
select sched.id, max(txn.dttm), acc.id
from PaymentSchedulePeriod sched
cross join PaymentSchedulePayment pay
right outer join AccountTransaction txn on pay.accountTransactionFk=txn.id
right outer join Account acc on txn.accountFk=acc.id
where sched.accountFk=acc.id
group by sched.id, acc.id
Modified query - cross join replaced by a comma (implicit cross join)
Returns one row
select sched.id, max(txn.dttm), acc.id
from PaymentSchedulePeriod sched
,PaymentSchedulePayment pay
right outer join AccountTransaction txn on pay.accountTransactionFk=txn.id
right outer join Account acc on txn.accountFk=acc.id
where sched.accountFk=acc.id
group by sched.id, acc.id
My understanding, which may be incorrect is that writing from Table1 a, Table2 b is the same as writing from Table 1 a cross join Table2 b. So I don't understand why the queries return different results.
Is is something to do with the interaction between the cross join and the outer joins in the first query that causes this? I've looked at the query plans and second query plan looks reasonable. The first one has no outer joins at all which is strange.
This is on SQLServer 2008.
JOIN has a higher precedence than a COMMA, so your second statement is interpreted as (note the parens I added):
select sched.id, max(txn.dttm), acc.id
from PaymentSchedulePeriod sched
,(PaymentSchedulePayment pay
right outer join AccountTransaction txn on pay.accountTransactionFk=txn.id
right outer join Account acc on txn.accountFk=acc.id)
where sched.accountFk=acc.id
group by sched.id, acc.id
See also: JOIN precendence rules per SQL-99
Without looking at the actual data and query plans, I'd say (ok, guess) it has to do with the way the optimizer builds the query plans.
In the first, it is more or less explicitly told to "take the first table, cross join it with the second, then right join in the third, then right join in the fourth"
In the second, that cross join is (at least to my way of thinking) implicit. This is "old" SQL syntax from the days when all joins were performed in the WHERE clause, which--again, to my way of thinking--means that the database engine was free to work out on its own the order in which to process tables. Or, in other words, SQL is not being give a specific order in which to join tables. (With inner joins and cross joins, it makes no difference, but with outer joins, it can make a huge difference.)
...I prefer #Joe's answer (upvoted), as it's technically accurate. I'm tossing it my own anyway just for detail's sake.

Join clause joining 3 tables in same criteria

I've saw a join just like this:
Select <blablabla>
from
TableA TA
Inner join TableB TB on Ta.Id = Tb.Id
Inner join TableC TC on Tc.Id = Tb.Id and Ta.OtheriD = Tc.OtherColumn
But what's the point (end effect) of that second join clause?
What the implications when an outer join clause is used?
And, more important, what is the best to rewrite it in a way that is easy
to understand what it's trying to join?
And, more important, what is the best way to rewrite it to get rid of the construction
and mantain the correctness of the query.
I don't specify the RDBMS, because it's a more generic question, but for those
curious (since people always ask): it's SQL Server 2005.
EDIT: It's just a made up example (since I would have to dig the original source - which I don't have access anymore). I found the original join clause on a 10 join SELECT command.
It simply means you have an extra restriction on the intersection between tablea and tablec.
Because we know Ta.Id = Tb.Id, Tc.Id = Tb.Id is the same as Tc.Id = Ta.Id. Inner joins are associative. So it makes more sense like this so each join is between 2 tables only
Select <blablabla>
from
TableB TB
Inner join
TableA TA on Tb.Id = Ta.Id --a and b intersection
Inner join
TableC TC on Ta.Id = Tc.Id and Ta.OtheriD = Tc.Column --a and c intersection
Your Q : But what's the point (end effect) of that second join clause?
Effectively filters rows...you could move the second half of the on statement into the where clause if you really want, only really effects readability. gbn's answer looks good for this 3 table example,but to expand on it...sometimes a rewrite like this isn't possible. I have seen an occasion where 2 different systems (one oracle 8i and one SQL server 2000) had their databases joined together. A 3 part key was identified as being required to make the records unique in both systems, but each component of the 3 part key was held in different tables...the final result had a few joins like that.
Functionally...I'm not sure if there's a difference really. Unless I'm completely off, readability seems to be the biggest difference.
Your Second Q: What the implications when an outer join clause is used?
You'll potentially get a bunch of nulls (pending how you setup the outer join) while the inner join would have dropped them. Be careful though...inner joins is associative...as gbn put it: An OUTER JOIN is different and order does matter
The user may want to furthur filter the set of rows which are included in the Join set...
The point of the second join is to further limit your result set based on the contents of TableC. The first join gives you ONLY records that exist in TA and TB. The second join gives you ONLY results from the first join that also exist in TC.

TABLE1 T1, TABLE2 T2 WHERE T1.Blah = T2.Blah - VS - INNER JOIN

Provided that the tables could essentially be inner joined, since the where clause excludes all records that don't match, just exactly how bad is it to use the first of the following 2 query statement syntax styles:
SELECT {COLUMN LIST}
FROM TABLE1 t1, TABLE2 t2, TABLE3 t3, TABLE4 t4 (etc)
WHERE t1.uid = t2.foreignid
AND t2.uid = t3.foreignid
AND t3.uid = t4.foreignid
etc
instead of
SELECT {COLUMN LIST}
FROM TABLE1 t1
INNER JOIN TABLE2 t2 ON t1.uid = t2.foreignid
INNER JOIN TABLE3 t3 ON t2.uid = t3.foreignid
INNER JOIN TABLE4 t4 ON t3.uid = t4.foreignid
I'm not sure if this is limited to microsoft SQL, or even a particular version, but my understanding is that the first scenario does a full outer join to make all possible correlations accessible.
I've used the first approach in the past to optimise queries that access two significantly large stores of data that each have peripheral table joined to them, with the product of those joins coming together late in the query. By allowing each of the "larger" table to join to their respective lookup tables, and only combining a specific subset of each of the larger tables, I found that there were notable speed improvements over introducing the large tables to each other prior to specific filtering.
Under normal (simple joins) circumstance, would it not be far better to use the second scenario? I find it to be more easily readable and it seems like it'll be much faster.
INNER JOIN ON vs WHERE clause
Maybe the best way to answer this is to take a look at how the database handles the query internally. If you're on SQL Server, use Profiler to see how many reads etc. each query takes and the query plan to see what route is being taken through the data. Statistics, skewing etc. will also most likely play a role.
The first query doesn't produce a full OUTER join (which is the union of both LEFT and RIGHT joins). Essentially unless there are some [internal] SQL parser - specific optimizations, both queries are equal.
Personally I would never use the first syntax. It may be the same performancewise but it is harder to maintain and far more subject to accidental cross joins when things get complex. If you miss an ON condition, it will fail the syntax check , if you miss one of the WHERE conditions that is the equivalent of an ON condition, it will happily do a cross join. It is also a syntax that is 17 years out of date for goodness sakes!
Further, the left and right join syntax in the old syntax are broken in SQL Server and do NOT always return the correct results (it can sometimes interpet the results as a corss join instead of an outerjoin) and they have been deprecated and will not be useable at all in the next version. If you need to change one of the queries to use an outer join, then you can be looikng at a major rewrite as it is especially bad to try to mix the two kinds of syntax.

Mixing Left and right Joins? Why?

Doing some refactoring in some legacy code I've found in a project. This is for MSSQL. The thing is, i can't understand why we're using mixed left and right joins and collating some of the joining conditions together.
My question is this: doesn't this create implicit inner joins in some places and implicit full joins in others?
I'm of the school that just about anything can be written using just left (and inner/full) or just right (and inner/full) but that's because i like to keep things simple where possible.
As an aside, we convert all this stuff to work on oracle databases as well, so maybe there's some optimization rules that work differently with Ora?
For instance, here's the FROM part of one of the queries:
FROM Table1
RIGHT OUTER JOIN Table2
ON Table1.T2FK = Table2.T2PK
LEFT OUTER JOIN Table3
RIGHT OUTER JOIN Table4
LEFT OUTER JOIN Table5
ON Table4.T3FK = Table5.T3FK
AND Table4.T2FK = Table5.T2FK
LEFT OUTER JOIN Table6
RIGHT OUTER JOIN Table7
ON Table6.T6PK = Table7.T6FK
LEFT OUTER JOIN Table8
RIGHT OUTER JOIN Table9
ON Table8.T8PK= Table9.T8FK
ON Table7.T9FK= Table9.T9PK
ON Table4.T7FK= Table7.T7PK
ON Table3.T3PK= Table4.T3PK
RIGHT OUTER JOIN ( SELECT *
FROM TableA
WHERE ( TableA.PK = #PK )
AND ( TableA.Date BETWEEN #StartDate
AND #EndDate )
) Table10
ON Table4.T4PK= Table10.T4FK
ON Table2.T2PK = Table4.T2PK
One thing I would do is make sure you know what results you are expecting before messing with this. Wouldn't want to "fix" it and have different results returned. Although honestly, with a query that poorly designed, I'm not sure that you are actually getting correct results right now.
To me this looks like something that someone did over time maybe even originally starting with inner joins, realizing they wouldn't work and changing to outer joins but not wanting to bother changing the order the tables were referenced in the query.
Of particular concern to me for maintenance purposes is to put the ON clauses next to the tables you are joining as well as converting all the joins to left joins rather than mixing right and left joins. Having the ON clause for table 4 and table 3 down next to table 9 makes no sense at all to me and should contribute to confusion as to what the query should actually return. You may also need to change the order of the joins in order to convert to all left joins. Personally I prefer to start with the main table that the others will join to (which appears to be table2) and then work down the food chain from there.
It could probably be converted to use all LEFT joins: I'd be looking and moving the right-hand table in each RIGHT to be above all the existing LEFTs, then you might be able to then turn every RIGHT join into a LEFT join. I'm not sure you'll get any FULL joins behind the scenes -- if the query looks like it is, it might be a quirk of this specific query rather than a SQL Server "rule": that query you've provided does seem to be mixing it up in a rather confusing way.
As for Oracle optimisation -- that's certainly possible. No experience of Oracle myself, but speaking to a friend who's knowledgeable in this area, Oracle (no idea what version) is/was fussy about the order of predicates. For example, with SQL Server you can write your way clause so that columns are in any order and indexes will get used, but with Oracle you end up having to specify the columns in the order they appear in the index in order to get best performance with the index. As stated - no idea if this is the case with newer Oracle's, but was the case with older ones (apparently).
Whether this explains this particular construction, I can't say. It could simply be less-thean-optimal code if it's changed over the years and a clean-up is what it's begging for.
LEFT and RIGHT join are pure syntax sugar.
Any LEFT JOIN can be transformed into a RIGHT JOIN merely by switching the sets.
Pre-9i Oracle used this construct:
WHERE table1.col(+) = table2.col
, (+) here denoting the nullable column, and LEFT and RIGHT joins could be emulated by mere switching:
WHERE table1.col = table2.col(+)
In MySQL, there is no FULL OUTER JOIN and it needs to be emulated.
Ususally it is done this way:
SELECT *
FROM table1
LEFT JOIN
table2
ON table1.col = table2.col
UNION ALL
SELECT *
FROM table1
RIGHT JOIN
table2
ON table1.col = table2.col
WHERE table1.col IS NULL
, and it's more convenient to copy the JOIN and replace LEFT with RIGHT, than to swap the tables.
Note that in SQL Server plans, Hash Left Semi Join and Hash Right Semi Join are different operators.
For the query like this:
SELECT *
FROM table1
WHERE table1.col IN
(
SELECT col
FROM table2
)
, Hash Match (Left Semi Join) hashes table1 and removes the matched elements from the hash table in runtime (so that they cannot match more than one time).
Hash Match (Right Semi Join) hashes table2 and removes the duplicate elements from the hash table while building it.
I may be missing something here, but the only difference between LEFT and RIGHT joins is which order the source tables were written in, and so having multiple LEFT joins or multiple RIGHT joins is no different to having a mix. The equivalence to FULL OUTERs could be achieved just as easily with all LEFT/RIGHT than with a mix, n'est pas?
We have some LEFT OUTER JOINs and RIGHT OUTER JOINs in the same query. Typically such queries are large, have been around a long time, probably badly written in the first place and have received infrequent maintenance. I assume the RIGHT OUTER JOINs were introduced as a means of maintaining the query without taking on the inevitable risk when refactoring a query significantly.
I think most SQL coders are most confortable with using all LEFT OUTER JOINs, probably because a FROM clause is read left-to-right in the English way.
The only time I use a RIGHT OUTER JOIN myself is when when writing a new query based on an existing query (no need to reinvent the wheel) and I need to change an INNER JOIN to an OUTER JOIN. Rather than change the order of the JOINs in the FROM clause just to be able to use a LEFT OUTER JOIN I would instead use a RIGHT OUTER JOIN and this would not bother me. This is quite rare though. If the original query had LEFT OUTER JOINs then I'd end up with a mix of LEFT- and RIGHT OUTER JOINs, which again wouldn't bother me. Hasn't happened to me yet, though.
Note that for SQL products such as the Access database engine that do not support FULL OUTER JOIN, one workaround is to UNION a LEFT OUTER JOIN and a RIGHT OUTER JOIN in the same query.
The bottom line is that this is a very poorly formatted SQL statement and should be re-written. Many of the ON clauses are located far from their JOIN statements, which I am not sure is even valid SQL.
For clarity's sake, I would rewrite the query using all LEFT JOINS (rather than RIGHT), and locate the using statements underneath their corresponding JOIN clauses. Otherwise, this is a bit of a train wreck and is obfuscating the purpose of the query, making errors during future modifications more likely to occur.
doesn't this create implicit inner
joins in some places and implicit full
joins in others?
Perhaps you are assuming that because you don't see the ON clause for some joins, e.g., RIGHT OUTER JOIN Table4, but it is located down below, ON Table4.T7FK= Table7.T7PK. I don't see any implicit inner joins, which could occur if there was a WHERE clause like WHERE Table3.T3PK is not null.
The fact that you are asking questions like this is a testament to the opaqueness of the query.
To answer another portion of this question that hasn't been answered yet, the reason this query is formatted so oddly is that it's likely built using the Query Designer inside SQL Management Studio. The give away is the combined ON clauses that happen many lines after the table is mentioned. Essentially tables get added in the build query window and the order is kept even if that way things are connected would favor moving a table up, so to speak, and keeping all the joins a certain direction.