Inner Joining big tables in Big Query

Inner Joining big tables in Big Query - google-bigquery

I am trying to perform an inner join between two big tables where each table consists of almost 30 million records. When I try running a simple INNER JOIN between these two tables I get an error as below asking me to use JOIN EACH syntax but I didn't find any proper documentation on google references for JOIN EACH. Can somebody share thoughts about this? Here is my error as below.
Error: Table too large for JOIN. Consider using JOIN EACH. For more details, please see https://developers.google.com/bigquery/docs/query-reference#joins

Looking at your question, seems like all you need is to read up a bit on the doc available.
Now, having read Jordan Tigani's book, I can tell you that when you join, the system actually sends the smaller table in every shard that handles your query. Since none of your table is under 8 Mb, what happens is that it cannot simply send your table (as it's simply too big).
The way "JOIN EACH" works is that it tells the system "hash the joining criteria on both tables, and send a subset of each table to a specific shard". Hashing means that whatever you use as a criteria for the inner join will actually end up in the same shard. It has impacts on performance, but it's the only thing that can make a JOIN where both tables are bigger than 8 mb go through.

Related

Do joins on multiple columns store the cartesian product?

I transliterated a script from T-SQL to U-SQL and ran into an issue running the job, namely that it seemed to get "stuck" on one of the stages - after 2.5 hours the job graph showed it had read in 200MB and had written over 3TB but wasn't anywhere near finished. (Didn't take a screenshot, sorry.)
I tracked it down to one of the queries joining a table with 34 million rows twice to a table with 1600 rows:
#ProblemQuery =
SELECT
gp.[Group], // 16 groups
gp.[Percentile], // 1-100
my_fn(lt1.[Value], lt2.[Value], gp.[Value]) AS CalculatedNumber
FROM
#LargeTable AS lt1
INNER JOIN #GroupPercent AS gp
ON lt1.[Group] == gp.[Group]
AND lt1.[Row ID] == gp.[Row ID 1]
INNER JOIN #Large Table AS lt2
ON gp.[Group] == lt2.[Group]
AND gp.[Row ID 2] == lt2.[Row ID]
;
It seems that the full cartesian product (~2e18 rows) is being stored during the processing rather than just the filtered 1600 rows. My first thought was that it might be from using AND rather than &&, but changing that made no difference.
I've managed to work around this by splitting the one query with two joins in to two queries with one join each, and the whole job completed in under 15 minutes without a storage blowout.
But it's not clear to me whether this is fully expected behaviour when multiple columns are used in the join or a bug, and whether there's a better approach to this sort of thing. I've got another similar query to split up (with more joins, and more columns in the join condition) and I can't help but feel there's got to be a less messy way of doing this.

U-SQL applies some join reorder heuristics (although I don't know how it deals with the apparent self-join). I doubt it is related to you using multiple columns in the join predicate. I assume that our heuristic may be off. Can you please either file an incident or send me the job link to [usql] at microsoft dot com? That way we can investigate what causes the optimizer to pick the worse plan.
Until then, splitting the joins into two statements and thus forcing the better join order is the best workaround.

SQL SELECT JOIN Query over multiple tables

I’m trying to get data about Installations in buildings. The problem is that one building can have multiple installations and I’m unsure how to adjust my sql for that as the initial table I query only holds the relations that own the buildings.
Here’s the situation.
Table 1 (RELRLGRP) holds the id of the group the relations that own the buildings that have the installations that have the data I need.
This is what I have so far, I’m worried I shouldn’t use this many joins in an SQL statement but cannot find a quicker link between the information I need from my starting point at the group of relations till the installation data I seek in the BORGINST table. Please disregard the select portion of the statement (removed it for clarity).
SELECT *
FROM RELRLGRP A
JOIN RELATION R ON A.RELATION_GC_ID = R.GC_ID
JOIN BUILDING G ON R.CODE = G.GC_CODE
JOIN INSTALL I ON G.GC_CODE = I.GC_CODE
JOIN BORGINST B ON I.GC_ID = B.GC_ID
WHERE A.RELGROUP_GC_ID LIKE '100109' (<- the group the relations belong to)
I’ve done some rudimentary SQL but this linking through tables is new territory for me, in that sense I’d be happy to know if this many join statements are the way to go or if I should head a different route entirely.

JesseJ - Since I don't know all the columns that exist in your tables, I am going to assume that you are joining on primary keys. If this is the case, your solution may be the only one available to link the RELRLGRP to the BORGINST table.
Linking multiple tables like you are doing can be common in a normalized database.
Example:
In the example I posted, in order to find the State where a particular transaction happened, you have to link all the tables together. There is no shortcut.

Don't sweat it: I have views with three times as many joins. Every join does add complexity and suck more processor, but it really comes down to performance: if this process doesn't finish as quickly as you need it to, you can look into other methods, but otherwise multiple joins like this are perfectly fine.

Does my previous SQL query/ies affect my current query?

I have multiple SQL queries that I run one after the other to get a set of data. In each query, there are a bunch of tables joined that are exactly the same with the other queries. For example:
Query1
SELECT * FROM
Product1TableA A1
INNER JOIN Product1TableB B on A1.BId = B.Id
INNER JOIN CommonTable1 C on C.Id = B.CId
INNER JOIN CommonTable2 D on D.Id = B.DId
...
Query2
SELECT * FROM Product2TableA A2
INNER JOIN Product2TableB B on A2.BId = B.Id
INNER JOIN CommonTable1 C on C.Id = B.CId
INNER JOIN CommonTable2 D on D.Id = B.DId
...
I am playing around re-ordering the joins (around 2 dozen tables joined per query) and I read here that they should not really affect query execution unless SQL "gives up" during optimization because of how big the query is...
What I am wondering is if bunching up common table joins at the start of all my queries actually helps...

In theory, the order of the joins in the from clause doesn't make a difference on query performance. For a small number of tables, there should be no difference. The optimizer should find the best execution path.
For a larger number of tables, the optimizer may have to short-circuit its search regarding join order. It would then be using heuristics -- and these could be affected by join order.
Earlier queries would have no effect on a particular execution plan.
If you are having problems with performance, I am guessing that join order is not the root cause. The most common problem that I have in SQL Server are inappropriate nested-loop joins -- and these can be handled with an optimizer hint.

I think I understood what he was trying to say/to do:
What I am wondering is if bunching up common table joins at the start
of all my queries actually helps...
Imagine that you have some queries and every query has more than 3 inner joins. The queries are different but always have (for example) 3 tables in common that are joined on the same fields. Now the question is:
what will happen if every query will start with these 3 tables in join, and all the other tables are joined after?
The answer is it will change nothing, i.e. optimizer will rearrange the tables in the way it thinks will bring to optimal execution.
The thing may change if, for example, you save the result of these 3 joins into a temporary table and then use this saved result to join with other tables. But this depends on the filters that your queries use. If you have appropriate indexes and your query filters are selective enough(so that your query returns very few rows) there is no need to cache intermediate no-filtered result that has too many rows because optimizer can choose to first filter every table and only then to join them

Gordon's answer is a good explanation, but this answer explains the JOIN's behavior and also specifies that SQL Server's version is relevant:
Although the join order is changed in optimisation, the optimiser
does't try all possible join orders. It stops when it finds what it
considers a workable solution as the very act of optimisation uses
precious resources.
While the optimizer tries its best in choosing a good order for the JOINs, having many JOINs creates a bigger chance of obtaining a not so good plan.
Personally, I have seen many JOINs in some views within an ERP and they usually ran ok. However, from time to time (based on client's data volume, instance configuration etc.), some selects from these views took much more than expected.
If this data reaches an actual application (.NET, JAVA etc.), a way is to cache information from all small tables, store it as dictionaries (hashes) and perform O(1) lookups based on the keys.
This provides the advantages of reducing the JOIN count and not performing reads from the database for these tables (except once when caching data). However, this increases the complexity of the application (cache management).
Another solution is use temporary tables and populate them in multiple queries to avoid many JOINs per single query. This solution usually performs better and also increases debuggability (if the query does not provide the correct data or no data at all, which of the 10-15 JOINs is the problem?).
So, my answer to your question is: you might get some benefit from reordering the JOIN clauses, but I recommend avoiding lots of JOINs in the first place.

Making a more efficient join

Here's my query, it is fairly straightforward:
SELECT
INVOICE_ITEMS.II_IVNUM, INVOICE_ITEMS.IIQSHP
FROM
INVOICE_ITEMS
LEFT JOIN
INVOICES
ON
INVOICES.INNUM = INVOICE_ITEMS.II_INNUM
WHERE
INVOICES.IN_DATE
BETWEEN
'2010-08-29' AND '2010-08-30'
;
I have very limited knowledge of SQL, but I'm trying to understand some of the concepts like subqueries and the like. I'm not looking for a redesign of this code, but rather an explanation of why it is so slow (600+ seconds on my test database) and how I can make it faster.
From my understanding, the left join is creating a virtual table and populating it with every result row from the join, meaning that it is processing every row. How would I stop the query from reading the table completely and just finding the WHERE/BETWEEN clause first, then creating a virtual table after that (if it is possible)?
How is my logic? Are there any consistently recommended resources to get me to SQL ninja status?
Edit: Thanks everyone for the quick and polite responses. Currently, I'm connecting over ODBC to a proprietary database that is used in the rapid application development framework called OMNIS. Therefore, I really have no idea what sort of optimization is being run, but I believe it is based loosely on MSSQL.

I would rewrite it like this, and make sure you have an index on i.INNUM, ii.INNUM, and i.IN_DATE. The LEFT JOIN is being turned into an INNER JOIN by your WHERE clause, so I rewrote it as such:
SELECT ii.II_IVNUM, ii.IIQSHP
FROM INVOICE_ITEMS ii
INNER JOIN INVOICES i ON i.INNUM = ii.II_INNUM
WHERE i.IN_DATE BETWEEN '2010-08-29' AND '2010-08-30'
Depending on what database you are using, what may be happening is all of the records from INVOICE_ITEMS are being joined (due to the LEFT JOIN), regardless of whether there is a match with INVOICE or not, and then the WHERE clause is filtering down to the ones that matched that had a date within range. By switching to an INNER JOIN, you may make the query more efficient, by only needing to apply the WHERE clause to INVOICES records that have a matching INVOICE_ITEMS record.

SInce that is a very basic query the optimizer should do fine with it, likely your problem would be incorrect indexing. DO you haveindexes on the In_date field and INVOICE_ITEMS.II_INNUM field? If you have properly set up PK Fk relationships, INVOICES.INNUM should already be indexed but FKs are not indexed automatically.

Your query is fine, it's the indexes you have to look at.
Are INVOICES.INNUM and INVOICE_ITEMS.II_INNUM indexed?
If not SQL has to do something called a 'scan' - it searches every single record.
You can think of indexes as like the tabs on the side of a phone book - you know where to start looking for people based on the first letters of their surname. Without an index (say you want to look for names that end in '...son') you have to search the entire book.
There are different types of index - they can be ordered (like the phone book index - all ordered by surname) or not (like the index at the back of a book - there's an overhead in finding the index and then the actual page).
You should also be able to view the query plan - this is how the server executes the SQL statement. That can tell you all sorts of more advanced stuff - for instance there are multiple ways to do the job: a merge join is possible if both tables are sorted by the join field or a nested join will loop through the smaller table for every record in the larger table.

well there is no reason why this query is slow... the only thing that comes to mind is, do you have indexes on INVOICES.INNUM = INVOICE_ITEMS.II_INNUM? if you add them it could speed up the select but it would slow down updates/inserts...

A join doesn't create a "virtual table" on anything more than just a conceptual level.
The performance issue with your query most likely lies in poor or insufficient indexing. You should have indexes on:
INVOICE_ITEMS.II_INNUM
INVOICES.IN_DATE
You should also have an index on INVOICES.INNUM, but if that's the primary key of the table then it already has one.
Also, don't use a left join here. If there's a foreign key between INVOICE_ITEMS.II_INNUM and INVOICES.INNUM (and INVOICE_ITEMS.II_INNUM is not nullable), then you'll never encounter a record in INVOICE_ITEMS that won't match up to a record in INVOICES. Even if there were, your WHERE condition is using a value from INVOICES, so you'd eliminate any unmatched rows anyway. Just use a regular JOIN.

TSQL Join efficiency

I'm developing an ASP.NET/C#/SQL application. I've created a query for a specific grid-view that involves a lot of joins to get the data needed. On the hosted server, the query has randomly started taking up to 20 seconds to process. I'm sure it's partly an overloaded host-server (because sometimes the query takes <1s), but I don't think the query (which is actually a view reference via a stored procedure) is at all optimal regardless.
I'm unsure how to improve the efficiency of the below query:
(There are about 1500 matching records to those joins, currently)
SELECT dbo.ca_Connections.ID,
dbo.ca_Connections.Date,
dbo.ca_Connections.ElectricityID,
dbo.ca_Connections.NaturalGasID,
dbo.ca_Connections.LPGID,
dbo.ca_Connections.EndUserID,
dbo.ca_Addrs.LotNumber,
dbo.ca_Addrs.UnitNumber,
dbo.ca_Addrs.StreetNumber,
dbo.ca_Addrs.Street1,
dbo.ca_Addrs.Street2,
dbo.ca_Addrs.Suburb,
dbo.ca_Addrs.Postcode,
dbo.ca_Addrs.LevelNumber,
dbo.ca_CompanyConnectors.ConnectorID,
dbo.ca_CompanyConnectors.CompanyID,
dbo.ca_Connections.HandOverDate,
dbo.ca_Companies.Name,
dbo.ca_States.State,
CONVERT(nchar, dbo.ca_Connections.Date, 103) AS DateView,
CONVERT(nchar, dbo.ca_Connections.HandOverDate, 103) AS HandOverDateView
FROM dbo.ca_CompanyConnections
INNER JOIN dbo.ca_CompanyConnectors ON dbo.ca_CompanyConnections.CompanyID = dbo.ca_CompanyConnectors.CompanyID
INNER JOIN dbo.ca_Connections ON dbo.ca_CompanyConnections.ConnectionID = dbo.ca_Connections.ID
INNER JOIN dbo.ca_Addrs ON dbo.ca_Connections.AddressID = dbo.ca_Addrs.ID
INNER JOIN dbo.ca_Companies ON dbo.ca_CompanyConnectors.CompanyID = dbo.ca_Companies.ID
INNER JOIN dbo.ca_States ON dbo.ca_Addrs.StateID = dbo.ca_States.ID

It may have nothing to do with your query and everything to do with the data transfer.
How fast does the query run in query analyzer?
How does this compare to the web page?
If you are bringing back the entire data set you may want to introduce paging, say 100 records per page.

The first thing I normally suggest is to profile to look for potential indexes to help out. But the when the problem is sporadic like this and the normal case is for the query to run in <1sec, it's more likely due to lock contention rather than a missing index. That means the cause is something else in the system causing this query to take longer. Perhaps an insert or update. Perhaps another select query — one that you would normally expect to take a little longer so the extra time on it's end isn't noted.

I would start with indexing, but I have a database that is a third-party application. Creating my own indexes is not an option. I read an article (sorry, can't find the reference) recommending breaking up the query into table variables or temp tables (depending on number of records) when you have multiple tables in your query (not sure what the magic number is).
Start with dbo.ca_CompanyConnections, dbo.ca_CompanyConnectors, dbo.ca_Connections. Include the fields you need. And then subsitute these three joined tables with just the temp table.
Not sure what the issue is (would like to here recommendations) but seems like when you get over 5 tables performance seems to drop.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas