Fundamentally understanding 3 or more table sql joins

Fundamentally understanding 3 or more table sql joins - sql

I apologize for the long question in advance. Most online articles don't go over this, they just show a quick result set. For such an important and commonly used idea, I want to fully understand this. I've seen a lot of the post on here with specific examples, but none got the core idea in my head. My question is when you do a 3+ table join, how does this work in memory? The statement I'm currently using is:
select a.cust_id, a.[first name],a.[last name],a.[primary zip],c.jerseynum
from contact as a
join notes as b
on a.cust_id = b.cust_id
join jerseytable as c
on a.cust_id = c.cust_id
so after the first join between a and b we get a result set, we'll call it 1
I then do a join on a and c... this is were it gets fuzzy for me. This result set doesn't just take the place of my previous join, does it only add records to 1 that fit just the join between a and c?

You're basically asking how a database does its query execution. There's a lot of theory and practice in this area, more than a single answer can give you.
The query engine has a lot of tools at its disposal, depending on the joins, the indexes, and other statistics it keeps. It can construct in-memory tables, reorder joins (in some cases) to better limit the number of returned rows. It might identify the results of the different joins and merge them together.
Read up on query plans to get started: http://en.wikipedia.org/wiki/Query_plan and the related section on query optimization.

JOIN is a relational operator: it takes two relations as parameters and the result is another relation.
Relational operators can be strung together. Consider your query written in the relational language Tutorial D:
Assuming x and y are suitably declared relation variables (relvars):
x := a MATCHING b;
y := x JOIN c {jerseynum};
Alternatively:
y := a JOIN c {jerseynum};
x := y MATCHING b;
However, the above forces an order of execution on the optimizer: assigning the intermediate results to relvars is essentially telling the optimizer how to do it's job (i.e. not good). They can be strung together e.g. as follows:
a MATCHING b JOIN c {jerseynum};
The SQL FROM clause works in a similar way i.e. no need to assign to intermediate (derived) tables. The optimizer is free to evaluate them in any order it sees fit. Trust the optimizer :)

a joined with b, then the result set is joined with c.
(If you use MS SQL Server, you can see this process in query execution plan).

After it parsed your query, the database engine will generate a plan which describes the actual steps to be taken to get the results of the query. You should examine your actual plan to get an insight what is really going on.
Basically the optimizer will choose the order of joins regardless of the order you wrote in the sql. The actual order of joins will depend among other things on the indexes and on the statistics kept on the data.
see this article on query optimizer http://research.microsoft.com/pubs/76059/pods98-tutorial.pdf

Related

Several layers of nested subqueries with Exists/In, best performance?

I'm working on some rather large queries for a search function. There are a number of different inputs and the queries are pretty big as a result. It's grown to where there are nested subqueries 2 layers deep. Performance has become an issue on the ones that will return a large dataset and likely have to sift through a massive load of records to do so. The ones that have less comparing to do perform fine, but some of these are getting pretty bad. The database is DB2 and has all of the necessary indexes, so that shouldn't be an issue. I'm wondering how to best write/rewrite these queries to perform as I'm not quite sure how the optimizer is going to handle it. I obviously can't dump the whole thing here, but here's an example:
Select A, B
from TableA
--A series of joins--
WHERE TableA.A IN (
Select C
from TableB
--A few joins--
WHERE TableB.C IN (
Select D from TableC
--More joins and conditionals--
)
)
There are also plenty of conditionals sprinkled throughout, the vast majority of which are simple equality. You get the idea. The subqueries do not provide any data to the initial query. They exist only to filter the results. A problem I ran into early on is that the backend is written to contain a number of partial query strings that get assembled into the final query (with 100+ possible combinations due to the search options, it simply isn't feasible to write a query for each), which has complicated the overall method a bit. I'm wondering if EXISTS instead of IN might help at one or both levels, or another bunch of joins instead of subqueries, or perhaps using WITH above the initial query for TableC, etc. I'm definitely looking to remove bottlenecks and would appreciate any feedback that folks might have on how to handle this.
I should probably also add that there are potential unions within both subqueries.

It would probably help to use inner joins instead.
Select A, B
from TableA
inner join TableB on TableA.A = TableB.C
inner join TableC on TableB.C = TableC.D
Databases were designed for joins, but the optimizer might not figure out that it can use an index for a sub-query. Instead it will probably try to run the sub-query, hold the results in memory, and then do a linear search to evaluate the IN operator for every record.
Now, you say that you have all of the necessary indexes. Consider this for a moment.
If one optional condition is TableC.E = 'E' and another optional condition is TableC.F = 'F',
then a query with both would need an index on fields TableC.E AND TableC.F. Many young programmers today think they can have one index on TableC.E and one index on TableC.F, and that's all they need. In fact, if you have both fields in the query, you need an index on both fields.
So, for 100+ combinations, "all of the necessary indexes" could require 100+ indexes.
Now an index on TableC.E, TableC.F could be use in a query with a TableC.E condition and no TableC.F condition, but could not be use when there is a TableC.F condition and no TableC.E condition.
Hundreds of indexes? What am I going to do?
In practice it's not that bad. Let's say you have N optional conditions which are either in the where clause or not. The number of combinations is 2 to the nth, or for hundreds of combinations, N is log2 of the number of combinations, which is between 6 and 10. Also, those log2 conditions are spread across three tables. Some databases support multiple table indexes, but I'm not sure DB2 does, so I'd stick with single table indexes.
So, what I am saying is, say for the TableC.E, and TableC.F example, it's not enough to have just the following indexes:
TableB ON C
TableC ON D
TableC ON E
TableC ON F
For one thing, the optimizer has to pick among which one of the last three indexes to use. Better would be to include the D field in the last two indexes, which gives us
TableB ON C
TableC ON D, E
TableC ON D, F
Here, if neither field E nor F is in the query, it can still index on D, but if either one is in the query, it can index on both D and one other field.
Now suppose you have an index for 10 fields which may or may not be in the query. Why ever have just one field in the index? Why not add other fields in descending order of likelihood of being in the query?
Consider that when planning your indexes.

I found out "IN" predicate is good for small subqueries and "EXISTS" for large subqueries.
Try to execute query with "EXISTS" predicate for large ones.
SELECT A, B
FROM TableA
WHERE EXISTS (
Select C
FROM TableB
WHERE TableB.C = TableA.A)

order of tables in FROM clause

For an sql query like this.
Select * from TABLE_A a
JOIN TABLE_B b
ON a.propertyA = b.propertyA
JOIN TABLE_C
ON b.propertyB = c.propertyB
Does the sequence of the tables matter. It wont matter in results, but do they affect the performance?
One can assume that the data in table C is much larger that a or b.

For each sql statement, the engine will create a query plan. So no matter how you put them, the engine will chose a correct path to build the query.
More on plans you have http://en.wikipedia.org/wiki/Query_plan
There are ways, considering what RDBMS you are using to enforce the query order and plan, using hints, however, if you feel that the engine does no chose the correct path.

Sometimes Order of table creates a difference here,(when you are using different joins)
Actually our Joins working on Cross Product Concept
If you are using query like this A join B join C
It will be treated like this (A*B)*C)
Means first result comes after joining A and B table then it will make join with C table
So if after inner joining A (100 record) and B (200 record) if it will give (100 record)
And then these ( 100 record ) will compare with (1000 record of C)

No.
Well, there is a very, very tiny chance of this happening, see this article by Jonathan Lewis. Basically, the number of possible join orders grows very quickly, and there's not enough time for the Optimizer to check them all. The sequence of the tables may be used as a tie-breaker in some very rare cases. But I've never seen this happen, or even heard about it happening, to anybody in real life. You don't need to worry about it.

Optimize Joins (more than 2 tables, where filter)

Couple of Sybase database query questions:
If I do an join and have a where clause, would the filter be applied prior to the actual join? In other words, is it faster than join without any where conditions?
I have an example involving 3 tables (with columns listed below):
A: O1,....
B: E1,E2,...
C: O1, E2, E2
So my join looks like:
select A.*, B* from B,C,A
where
C.E1=B.E1 and C.E2=B.E2 and C.O1=A.O1
and A.O2 in (...)
and B.E3 in (...)
Would my joins be any significantly faster if I eliminated C and added O1 to table B instead?
B:E1,E2,O1....

First, you should use proper join syntax:
select A.*, B.*
from B join C
on C.E1=B.E1 and C.E2=B.E2 join
A
on C.O1=A.O1
where B.E3 in (...)
The "," means cross join and it is prone to problems since it is easily missed. Your query becomes much different if you say "from B C, A". Also, it gives you access to outer joins, and makes the statement much more readable.
The answer to your question is "yes". Your query will run faster if there are fewer tables being joined. If you are doing a join on a primary key, and the tables fit into memory, then the join is not expensive. However, it is still faster to just find the data in the record in B.
As I say this, there could be some boundary cases in some databases where this is not necessarily true. For instance, if there is only one value in the column, and the column is a long string, then adding the column onto pages could increase the number of pages needed for B. Such extreme cases are unlikely, and you should see performance improvements.

Speed will generally depends on number or rows the SQL Server has to read.
I don't think it makes any difference using a where clause or a join
Depends .. on how many rows are in the eliminated table
It can depend on the order you add the joins or where clauses in, e.g. if there are only a few rows in C, and you add that first as a table or where, it immediately cuts down on the number of matches that are possible in. If, however there are millions of rows in C, then you have to read millions to find the matches.
Modern optimizers can rearrange your query to be more efficient but dont rely on it.
What can really cut down the number of rows read is adding indexes to the join columns - if you have an index on A.O1 AND on C.O1 then it can cut down massivley on the number of reads.

SQL Server Logical Query Processing - how does it manage the huge datasets?

I'm doing some reading on SQL Server performance:
http://www.amazon.com/Inside-Microsoft-SQL-Server-2005/dp/0735623139/ref=sr_1_6?ie=UTF8&s=books&qid=1267032068&sr=8-6
One of the surprising things I came across was how it processes the "FROM" phase in its Logical Processing. From what I understand, SQL Server will do the following:
1) For the first two tables, it will create a virtual table (VT1) consisting of a Cartesian join of the two tables
2) For every additional table, it will create a Cartesian join of VT1 and the additional table, with the result becoming VT1
I'm sure there is alot more to it under the covers, but at face value, this seems like it would involve a huge amount of processing/memory if you're dealing with big tables (and big queries).
I was just wondering whether anyone had a quick explanation of how SQL Server is able to do this in any sort of realistic time/space frame?

The carthesian join is just a description of the result, not an actual result. After the full carthesian join of tables A, B, C...X, the filter operators are applied (still as a definition), things like ON clauses of the join and WHERE clauses of the query. In the end this definition is in turn transformed into an execution plan, which will contain physicall operators like Nested Loops or Hash Join or Merge Join, and this operators, when iterated, will produce the results as requested in the query definition.
So the big 100x100x100x100... carthesian cube is never materialized, is just a definition.

If you are really interested in how SQL Server does what it does, please read this book:
http://www.amazon.com/Microsoft-SQL-Server-2008-Internals/dp/0735626243/ref=sr_1_1?ie=UTF8&s=books&qid=1267033666&sr=8-1

In reality the optimiser will look at the whole query, estimated rows, statistics, constraints etc
Logically, it is in the order mentioned though
Contrived example:
SELECT
BT.col1, LT.col2
FROm
BigTable BT
JOIN
LT.Table LT ON BT.FKCol = LT.PKCol
WHERE
LT.PKCol = 2
ORDER BY
BT.col1
The cartesian of BT and LT could be 100s of millions.
But the optimiser:
knows PKCol is unique so it expects only one row
can use statistics to estimate the number of rows from BT
looks for indexes (eg covering index on BT for FKCol INLCUDE col1)
will probably apply the WHERE first
will look ahead for an ORDER BY or GROUP BY for example to see if it can save some spooling (resorting)

I don't know the resource you are reading, but what you describe is the behavior of:
SELECT ... FROM tableA, tableB, tableC, ....
This uses a cartesian join (also called a cross join) and is very expensive. With large enough datasets SQL Server (or any RDBMS) can't do this in any sort of realistic time/space frame.
Using an ON clause and specifying the JOIN type performs vastly better:
SELECT ... FROM tableA JOIN tableB on tableB.a_id = tableA.a_id
In real applications cross joins should be rare or at least limited to very small datasets. For many applications it's not uncommon to never have a cross join.

simple sql query

which one is faster
select * from parents p
inner join children c on p.id = c.pid
where p.x = 2
OR
select * from
(select * from parents where p.x = 2)
p
inner join children c on p.id = c.pid
where p.x = 2

In MySQL, the first one is faster:
SELECT *
FROM parents p
INNER JOIN
children c
ON c.pid = p.id
WHERE p.x = 2
, since using an inline view implies generating and passing the records twice.
In other engines, they are usually optimized to use one execution plan.
MySQL is not very good in parallelizing and pipelining the result streams.
Like this query:
SELECT *
FROM mytable
LIMIT 1
is instant, while this one (which is semantically identical):
SELECT *
FROM (
SELECT *
FROM mytable
)
LIMIT 1
will first select all values from mytable, buffer them somewhere and then fetch the first record.
For Oracle, SQL Server and PostgreSQL, the queries above (and both of your queries) will most probably yield the same execution plans.

I know this is a simple case, but your first option is much more readable than the second one. As long as the two query plans are comparable I'd always opt for the more maintainable SQL code which your first example is for me.

It depends on how good the database is at optimising the query.
If the database manages to optimise the second one into the first one, they are equally fast, otherwise the first one is faster.
The first one gives more freedom for the database to optimise the query. The second one suggests a specific order of doing things. Either the database is able to see past this and optimise it into a single query, or it will run the query as two separate queries with the subquery as an intermediate result.
A database like SQL Server keeps statistics on what the database tables contain, which it uses to determine how to execute the query in the most efficient way. For example, depending on what will elliminate most records it can either start with joining the tables or filtering the parents table on the condition. If you write a query that forces a specific order, that might not be the most efficient order.

I'd think the first. I'm not sure if the optimizer would use any indexes on the the derived table in the second query, or if it would copy out all the rows that match into memory before joining back to the children.

This is why you have DBAs. It depends entirely on the DBMS, and how your tables and indexes are configured, as to which one runs the fastest.
Database tuning is not a set-and-forget operation, it should be done regularly, as the data changes, to ensure your database runs at peak performance. The question is not really meaningful without specifying:
which DBMS you are asking about.
what indexes you have on the tables.
a host of other possible configuration items (which may also depend on the DBMS, such as clustering).
You should run both those queries through the query optimizer to see which one is fastest, then start using that one. That's assuming the difference in noticeable in the first place. If the difference is minimal, go for the easiest to read/maintain.

For me, in the second query you are saying, I don't trust the optimizer to optimize this query so I'll provide some 'hints'.
I'd say, trust the optimizer until it let's you down and only then consider trying to do the optimizer's job for it.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas