Why multiple joins take forever? - google-bigquery

I believe I found a bug in Google BigQuery but I'm not sure.
I am hoping someone could offer a workaround.
The table I'm running on a table with only 200K of data.
On my attempts to do a funnel analysis I stumbled upon the following bizarre behaviour:
This takes ~3 seconds:
SELECT
COUNT(DISTINCT Q0._user_id) AS step0
FROM
(SELECT _user_id FROM [5629499534213120.201501]) AS Q0
LEFT OUTER JOIN
(SELECT _user_id, _time FROM [5629499534213120.201501] WHERE _os=='Windows') AS Q1
ON (Q0._user_id=Q1._user_id)
This takes ~3 Minutes:
SELECT
COUNT(DISTINCT Q0._user_id) AS step0
FROM
(SELECT _user_id FROM [5629499534213120.201501]) AS Q0
LEFT OUTER JOIN
(SELECT _user_id, _time FROM [5629499534213120.201501] WHERE _os=='Windows') AS Q1
ON (Q0._user_id=Q1._user_id)
LEFT OUTER JOIN
(SELECT _user_id, _time FROM [5629499534213120.201501] WHERE _country=='de') AS Q2
ON (Q0._user_id=Q2._user_id)
Meaning adding one more Left Join makes the query unbelievably slow (we're talking about only 200k of data).
Obviously, I have simplified the Select statement so you could focus on the main issue (The real select statement I used is far more complicated)
Does anyone know what's the problem, or a workaround for it?

I responded to this on the BigQuery issue tracker, but I'm reposting my answer here:
I'm a bigquery engineer and I looked up your query in our logs.
What you're seeing is a join explosion.
You did a 3-way self-join with non-unique keys. The field "_user_id" had a single value that matched 3937 rows on the left, 1388 rows in the first join, and 1388 rows in the second join.
That means you're creating 3937*1388*1488 or 7.5 billion output rows. (you then did a count distinct over them to reduce the output size, but the intermediate values needed to be created first).
It is not surprising that creating 7.5 billion intermediate rows would take a couple of minutes, especially since they were all from a single key, and hence had to be produced by a single worker task.
My guess is that it would be possible to restructure your query to avoid the join explosion.

I am not familiar with BigQuery specifically, but I suspect the inner queries (SELECT _user_id, _time FROM [...) are retrieving the entire table.
What about rewording the query as follows:
SELECT
COUNT(DISTINCT Q0._user_id) AS step0
FROM
[5629499534213120.201501] AS Q0
LEFT OUTER JOIN [5629499534213120.201501] AS Q1
ON (Q0._user_id=Q1._user_id)
LEFT OUTER JOIN [5629499534213120.201501] AS Q2
ON (Q0._user_id=Q2._user_id)
WHERE Q1._os=='Windows'
AND Q2._country=='de'
As far as I can tell, the result should be the same; wording it like this should hopefully allow the database to use indexes (if the database is properly normalized).

Related

Understanding when to use a subquery over a join

I seem to be missing something. I keep reading that you should use a join instead of a sub-select in most articles I read. However running a quick experiment myself shows a big win for the sub-query when it comes down to execution time.
Trying to get all first names of people that have made a bid (I presume the tables speak for themselves) results in the follwing.
This join takes 10 seconds
select U.firstname
from Bid B
inner join [User] U on U.userName = B.[user]
This query with sub-query takes 3 seconds
select firstname
from [User]
where userName in (select [user] from bid)
Why is my experiment not in line with what I keep reading everywhere or am I missing something?
Experimenting on I found that execution times are the same after adding distinct to both.
They're not the same thing. In the query with joins you can potentially multiply rows or have rows entirely removed from the results.
Inner Join removes rows on non-matched keys. It also multiplies rows on any matched keys that repeat in either one or both tables being joined. Inner Join therefor goes through the additional step of multiplying and removing rows.
The subquery you used is a SELECT. Since there are no filters using a WHERE it is as fast as a simple SELECT and since there are no joins you get results as fast as the results can be selected.
Some may argue that Outer joins return NULLs similar to sub-queries- but they can still multiply rows. Hence, sub-queries and joins are not the same thing.
In the queries you provided, you want to use the 2nd query (the one with the subquery) since it doesn't multiply or remove rows.
Good Read for Subquery vs Inner Join
https://www.essentialsql.com/subquery-versus-inner-join/

Fast Query with Left Join To Get Record With Latest Date

For this problem, I have 2 main existing postgres tables I am working with. The first table is named client, the second table is named task.
A single client can have multiple tasks, each with it's own scheduled_date and scheduled_time.
I'm trying to run a query that will return a list of all clients along with the date/time of their latest task.
Currently, my query works and looks something like this...
SELECT
c.*,
t1.scheduled_time||' '||t1.scheduled_time::timestamp AS latest_task_datetime
FROM
client c
LEFT JOIN
task t1 ON t1.client_id = c.client_id
LEFT JOIN
task t2 ON t2.client_id = t1.client_id AND ((t1.scheduled_date||' '||t1.scheduled_time)::timestamp < (t2.scheduled_date||' '||t2.scheduled_time)::timestamp) OR ((t1.scheduled_date||' '||t1.scheduled_time)::timestamp = (t2.scheduled_date||' '||t2.scheduled_time)::timestamp AND t1.task_id < t2.task_id);
The problem I'm having is the actual query I am working with deals with a lot more other tables (7+ tables), and every table has a lot of data in them, so because of the two left joins shown above, it is slowing down the execution of the query from 4 seconds to almost 45 seconds, which of course is very bad.
Does anyone know a possible faster way to write this query to run more efficiently?
A question I think you might initially have after seeing this is why I have scheduled_date and scheduled_time as separate columns? Why not have it as just a single timestamp column? The answer to that is this is an existing table that I can't change, at least not easily without requiring a lot of work making the changes in the entire server to support it.
Edit: Not quite the solution, but I just ended up doing it a different way. (See my comment below)
If you want to get multiple columns of information from different tables -- but one row for each client and his/her latest task, then you can use distinct on:
SELECT DISTINCT ON (c.client_id) c.*, t.*
FROM client c LEFT JOIN
task t
ON t.client_id = c.client_id
ORDER BY c.client_id, t.scheduled_date desc, t.scheduled_time desc;

JOIN versus EXISTS performance

Generally speaking, is there a performance difference between using a JOIN to select rows versus an EXISTS where clause? Searching various Q&A web sites suggests that a join is more efficient, but I recall learning a long time ago that EXISTS was better in Teradata.
I do see other SO answers, like this and this, but my question is specific to Teradata.
For example, consider these two queries, which return identical results:
select svc.ltv_scr, count(*) as freq
from MY_BASE_TABLE svc
join MY_TARGET_TABLE x
on x.srv_accs_id=svc.srv_accs_id
group by 1
order by 1
-and-
select svc.ltv_scr, count(*) as freq
from MY_BASE_TABLE svc
where exists(
select 1
from MY_TARGET_TABLE x
where x.srv_accs_id=svc.srv_accs_id)
group by 1
order by 1
The primary index (unique) on both tables is 'srv_accs_id'. MY_BASE_TABLE is rather large (200 million rows) and MY_TARGET_TABLE relatively small (200,000 rows).
There is one significant difference in the EXPLAIN plans: The first says the two tables are joined "by way of a RowHash match scan" and the second says "by way of an all-rows scan". Both say it is "an all-AMPs JOIN step" and the total estimated time is identical (0.32 seconds).
Both queries perform the same (I'm using Teradata 13.10).
A similar experiment to find non-matches comparing a LEFT OUTER JOIN with a corresponding IS NULL where clause to a NOT EXISTS sub-query does show a performance difference:
select svc.ltv_scr, count(*) as freq
from MY_BASE_TABLE svc
left outer join MY_TARGET_TABLE x
on x.srv_accs_id=svc.srv_accs_id
where x.srv_accs_id is null
group by 1
order by 1
-and-
select svc.ltv_scr, count(*) as freq
from MY_BASE_TABLE svc
where not exists(
select 1
from MY_TARGET_TABLE x
where x.srv_accs_id=svc.srv_accs_id)
group by 1
order by 1
The second query plan is faster (2.21 versus 2.14 seconds as described by EXPLAIN).
My example may be too trivial to see a difference; I'm just looking for coding guidance.
NOT EXISTS is more efficient than using a LEFT OUTER JOIN to exclude records that are missing from the participating table using an IS NULL condition because the optimizer will elect to use an EXCLUSION MERGE JOIN with the NOT EXISTS predicate.
While your second test did not yield impressive results for the data sets you were using the performance increase from NOT EXISTS over a LEFT JOIN is very noticeable as your data volumes increase. Keep in mind that the tables will need to be hash distributed by the columns that participate in the NOT EXISTS join just like they would in the LEFT JOIN. Therefore, data skew can impact the performance of the EXCLUSION MERGE JOIN.
EDIT:
Typically, I would defer to EXISTS as a replacement for IN instead of using it for re-writing a join solution. This is especially true when the column(s) participating in the logical comparison can be NULL. That's not to say you couldn't use EXISTS in place of an INNER JOIN. Instead of an EXCLUSION JOIN you will end up with an INCLUSION JOIN. The INNER JOIN is in essence an inclusion join to begin with. I'm sure there are some nuances that I am overlooking but you can find those in the manuals if you wish to take the time to read them.

Question about SQL Server Optmization Sub Query vs. Join

Which of these queries is more efficient, and would a modern DBMS (like SQL Server) make the changes under the hood to make them equal?
SELECT DISTINCT S#
FROM shipments
WHERE P# IN (SELECT P#
FROM parts
WHERE color = ‘Red’)
vs.
SELECT DISTINCT S#
FROM shipments, parts
WHERE shipments.P# = parts.P#
AND parts.color = ‘Red’
The best way to satiate your curiosity about this kind of thing is to fire up Management Studio and look at the Execution Plan. You'll also want to look at SQL Profiler as well. As one of my professors said: "the compiler is the final authority." A similar ethos holds when you want to know the performance profile of your queries in SQL Server - just look.
Starting here, this answer has been updated
The actual comparison might be very revealing. For example, in testing that I just did, I found that either approach might yield the fastest time depending on the nature of the query. For example, a query of the form:
Select F1, F2, F3 From Table1 Where F4='X' And UID in (Select UID From Table2)
yielded a table scan on Table1 and a mere index scan on table 2 followed by a right semi join.
A query of the form:
Select A.F1, A.F2, A.F3 From Table1 A inner join Table2 B on (A.UID=B.UID)
Where A.Gender='M'
yielded the same execution plan with one caveat: the hash match was a simple right join this time. So that is the first thing to note: the execution plans were not dramatically different.
These are not duplicate queries though since the second one may return multiple, identical records (one for each record in table 2). The surprising thing here was the performance: the subquery was far faster than the inner join. With datasets in the low thousands (thank you Red Gate SQL Data Generator) the inner join was 40 times slower. I was fairly stunned.
Ok, how about a real apples to apples? This is the matching inner join - note the extra step to winnow out the duplicates:
Select Distinct A.F1, A.F2, A.F3 From Table1 A inner join Table2 B
on (A.UID=B.UID)
Where A.Gender='M'
The execution plan does change in that there is an extra step - a sort after the inner join. Oddly enough, though, the time drops dramatically such that the two queries are almost identical (on two out of five trials the inner join is very slightly faster). Now, I can imagine the first inner join (without the "distinct") being somewhat longer just due to the fact that more data is being forwarded to the query window - but it was only twice as much (two Table2 records for every Table1 record). I have no good explanation why the first inner join was so much slower.
When you add a predicate to the search on table 2 using a subquery:
Select F1, F2, F3 From Table1 Where F4='X' And UID in
(Select UID From Table2 Where F1='Y')
then the Index Scan is changed to a Clustered Index Scan (which makes sense since the UID field has its own index in the tables I am using) and the percentage of time it takes goes up. A Stream Aggregate operation is also added. Sure enough, this does slow the query down. However, plan caching obviously kicks in as the first run of the query shows a much greater effect than subsequent runs.
When you add a predicate using the inner join, the entire plan changes pretty dramatically (left as an exercise to the reader - this post is long enough). The performance, again, is pretty much the same as that of the subquery - as long as the "Distinct" is included. Similar to the first example, omitting distinct led to a significant increase in time to completion.
One last thing: someone suggested (and your question now includes) a query of the form:
Select Distinct F1, F2, F3 From table1, table2
Where (table1.UID=table2.UID) AND table1.F4='X' And table2.F1='Y'
The execution plan for this query is similar to that of the inner join (there is a sort after the original table scan on table2 and a merge join rather than a hash join of the two tables). The performance of the two is comparable as well. I may need a larger dataset to tease out difference but, so far, I'm not seeing any advantage to this construct or the "Exists" construct.
With all of this being said - your results may vary. I came nowhere near covering the full range of queries that you may run into when I was doing the above tests. As I said at the beginning, the tools included with SQL Server are your friends: use them.
So: why choose one over the other? It really comes down to your personal preferences since there appears to be no advantage for an inner join to a subquery in terms of time complexity across the range of examples I tests.
In most classic query cases I use an inner join just because I "grew up" with them. I do use subqueries, however, in two situations. First, some queries are simply easier to understand using a subquery: the relationship between the tables is manifest. The second and most important reason, though, is that I am often in a position of dynamically generating SQL from within my application and subqueries are almost always easier to generate automatically from within code.
So, the takeaway is simply that the best solution is the one that makes your development the most efficient.
Using IN is more readable, and I recommend using ANSI-92 over ANSI-89 join syntax:
SELECT DISTINCT S#
FROM SHIPMENTS s
JOIN PARTS p ON p.p# = s.p#
AND p.color = 'Red'
Check your explain plans to see which is better, because it depends on data and table setup.
If you aren't selecting anything from the table I would use an EXISTS clause.
SELECT DISTINCT S#
FROM shipments a
WHERE EXISTS (SELECT 1
FROM parts b
WHERE b.color = ‘Red’
AND a.P# = b.P#)
This will optimize out to be the same as the second one you posted.
SELECT DISTINCT S#
FROM shipments,parts
WHERE shipments.P# = parts.P# and parts.color = ‘Red’;
Using IN forces SQL Server to not use indexing on that column, and subqueries are usually slower

Understanding how JOIN works when 3 or more tables are involved. [SQL]

I wonder if anyone can help improve my understanding of JOINs in SQL. [If it is significant to the problem, I am thinking MS SQL Server specifically.]
Take 3 tables A, B [A related to B by some A.AId], and C [B related to C by some B.BId]
If I compose a query e.g
SELECT *
FROM A JOIN B
ON A.AId = B.AId
All good - I'm sweet with how this works.
What happens when Table C (Or some other D,E, .... gets added)
In the situation
SELECT *
FROM A JOIN B
ON A.AId = B.AId
JOIN C ON C.BId = B.BId
What is C joining to? - is it that B table (and the values therein)?
Or is it some other temporary result set that is the result of the A+B Join that the C table is joined to?
[The implication being not all values that are in the B table will necessarily be in the temporary result set A+B based on the join condition for A,B]
A specific (and fairly contrived) example of why I am asking is because I am trying to understand behaviour I am seeing in the following:
Tables
Account (AccountId, AccountBalanceDate, OpeningBalanceId, ClosingBalanceId)
Balance (BalanceId)
BalanceToken (BalanceId, TokenAmount)
Where:
Account->Opening, and Closing Balances are NULLABLE
(may have opening balance, closing balance, or none)
Balance->BalanceToken is 1:m - a balance could consist of many tokens
Conceptually, Closing Balance of a date, would be tomorrows opening balance
If I was trying to find a list of all the opening and closing balances for an account
I might do something like
SELECT AccountId
, AccountBalanceDate
, Sum (openingBalanceAmounts.TokenAmount) AS OpeningBalance
, Sum (closingBalanceAmounts.TokenAmount) AS ClosingBalance
FROM Account A
LEFT JOIN BALANCE OpeningBal
ON A.OpeningBalanceId = OpeningBal.BalanceId
LEFT JOIN BALANCE ClosingBal
ON A.ClosingBalanceId = ClosingBal.BalanceId
LEFT JOIN BalanceToken openingBalanceAmounts
ON openingBalanceAmounts.BalanceId = OpeningBal.BalanceId
LEFT JOIN BalanceToken closingBalanceAmounts
ON closingBalanceAmounts.BalanceId = ClosingBal.BalanceId
GROUP BY AccountId, AccountBalanceDate
Things work as I would expect until the last JOIN brings in the closing balance tokens - where I end up with duplicates in the result.
[I can fix with a DISTINCT - but I am trying to understand why what is happening is happening]
I have been told the problem is because the relationship between Balance, and BalanceToken is 1:M - and that when I bring in the last JOIN I am getting duplicates because the 3rd JOIN has already brought in BalanceIds multiple times into the (I assume) temporary result set.
I know that the example tables do not conform to good DB design
Apologies for the essay, thanks for any elightenment :)
Edit in response to question by Marc
Conceptually for an account there should not be duplicates in BalanceToken for An Account (per AccountingDate) - I think the problem comes about because 1 Account / AccountingDates closing balance is that Accounts opening balance for the next day - so when self joining to Balance, BalanceToken multiple times to get opening and closing balances I think Balances (BalanceId's) are being brought into the 'result mix' multiple times. If it helps to clarify the second example, think of it as a daily reconciliation - hence left joins - an opening (and/or) closing balance may not have been calculated for a given account / accountingdate combination.
Conceptually here is what happens when you join three tables together.
The optimizer comes up with a plan, which includes a join order. It could be A, B, C, or C, B, A or any of the combinations
The query execution engine applies any predicates (WHERE clause) to the first table that doesn't involve any of the other tables. It selects out the columns mentioned in the JOIN conditions or the SELECT list or the ORDER BY list. Call this result A
It joins this result set to the second table. For each row it joins to the second table, applying any predicates that may apply to the second table. This results in another temporary resultset.
Then it joins in the final table and applies the ORDER BY
This is conceptually what happens. Infact there are many possible optimizations along the way. The advantage of the relational model is that the sound mathematical basis makes various transformations of plan possible while not changing the correctness.
For example, there is really no need to generate the full result sets along the way. The ORDER BY may instead be done via accessing the data using an index in the first place. There are lots of types of joins that can be done as well.
We know that the data from B is going to be filtered by the (inner) join to A (the data in A is also filtered). So if we (inner) join from B to C, thus the set C is also filtered by the relationship to A. And note also that any duplicates from the join will be included.
However; what order this happens in is up to the optimizer; it could decide to do the B/C join first then introduce A, or any other sequence (probably based on the estimated number of rows from each join and the appropriate indexes).
HOWEVER; in your later example you use a LEFT OUTER join; so Account is not filtered at all, and may well my duplicated if any of the other tables have multiple matches.
Are there duplicates (per account) in BalanceToken?
I often find it helps to view the actual execution plan. In query analyser/management studio, you can turn this on for queries from the Query menu, or use Ctrl+M. After running the query, the plan that was executed is shown in another result tab. From this you'll see that C and B are joined first, and then the result is joined with A. The plan might vary depending on information the DBMS has because both joins are inner, making it A-and-B-and-C. What I mean is that the result will be the same regardless of which is joined first, but the time it takes might differ greatly, and this is where the optimiser and hints come into play.
Joins can be tricky, and much of the behavior is of course dictated by how the data is stored in the actual tables.
Without seeing the tables it's hard to give a clear answer in your particular case but I think the basic issue is that you are summing over multiple result sets that are being combined into one.
Perhaps instead of multiple joins you should make two separate temporary tables in your query, one with the accountID, date and sum of openingbalances, a second one with the accountID, date and sum of closing balances, then joining those two on AccountID and date.
In order to find out exactly what is happening with joins, also in your specific case, I would do the following:
Change the initial part
SELECT accountID Accountbalancedate, sum(...) as openingbalance,
sum(...) as closingbalance FROM
to simply
"SELECT * FROM"
Study the resulting table, and you will see exactly what data is being duplicated. Remove the joins one by one and see what happens. This should give you a clue to what it is about your particular data that is causing the dupes.
If you open the query in SQL server management studio (Free version exists) you can edit the query in the designer. The visual view of how the tables are being joined might also help you realize what's going on.