Many to one joins - sql

Lets say I have the following tables:
create table table_a
(
id_a,
name_a,
primary_key (id_a)
);
create table table_b
(
id_b,
id_a is not null, -- (Edit)
name_b,
primary_key (id_b),
foreign_key (id_a) references table_a (id_a)
);
We can create a join view on these tables in a number of ways:
create view join_1 as
(
select
b.id_b,
b.id_a,
b.name_b,
a.name_a
from table_a a, table_b b
where a.id_a = b.id_a
);
create view join_2 as
(
select
b.id_b,
b.id_a,
b.name_b,
a.name_a
from table_b b left outer join table_a a
on a.id_a = b.id_a
);
create view join_3 as
(
select
b.id_b,
b.id_a,
b.name_b,
(select a.name_a from table_a a where b.id_b = a.id_a) as name_a
from table_b b;
);
Here we know:
(1) There must be at least one entry from table_a with id_a (due to the foreign key in table B) AND
(2) At most one entry from table_a with id_a (due to the primary key on table A)
then we know that there is exactly one entry in table_a that links in with the join.
Now consider the following SQL:
select id_b, name_b from join_X;
Note that this selects no columns from table_a, and because we know in this join, table_b joins to exactly one we really shouldn't have to look at table_a when performing the above select.
So what is the best way to write the above join view?
Should I just use the standard join_1, and hope the optimizer figures out that there is no need to access table_a based on the primary and foreign keys?
Or is it better to write it like join_2 or even join_3, which makes it more explicit that there is exactly one row from the join for each row from table_b?
Edit + additional question
Is there any time I should prefer a sub-select (as in join_3) over a normal join (as in join_1)?

Why are you using views at all for this purpose? If you want to get data from the table, get it from the table.
Or, if you need a view to translate some columns in the tables (such as coalescing NULLs into zeros), create a view over just the B table. This will also apply if some DBA wanna-be has implemented a policy that all selects must be via views rather than tables :-)
In both those cases, you don't have to worry about multi-table access.

Intuitively, I'd think join_1 will perform slightly slower, because your assumption that the optimiser can transform the join away is wrong as you didn't declare the table_b.id_a column to be NOT NULL. In fact, this means that (1) is wrong. table_b.id_a can be NULL. Even if you know it can't be, the optimiser doesn't know that.
As far as join_2 and join_3 is concerned, depending on your database, optimisation might be possible. The best way to find out is to run (Oracle syntax)
EXPLAIN select id_b, name_b from join_X;
And study the execution plan. It will tell you whether table_a was joined or not. On the other hand, if your view should be reusable, then I'd go for the plain join and forget about pre-mature optimisations. You can achieve better results with proper statistics and indexes, as a join operation is not always so expensive. But that depends on your statistics, of course.

1 + 2 are effectively identical under SQL Server.
I have never used [3], but it looks quite odd. I would strongly suspect the optimizer will make it equivalent to the other 2.
It's a good exercise to run all 3 statements and compare the execution plans produced.
So given identical performance, the clearest to read gets my vote - [2] is the standard where it is supported otherwise [1].
In your case, if you don't want any columns from A, why include Table_A in the statement anyway?
If it's simply a filter - i.e. only include rows from Table B where a row exists in Table A even though I don't want any cols from Table A, then all 3 syntaxes are fine, although you may find that use of IF EXISTS is more performant in some dbs:
SELECT * from Table_B b WHERE EXISTS (SELECT 1 FROM Table_A a WHERE b.id_b = a.id_a)
although in my experience this is usual equivalent in performance to any of the others.
You also ask, then would you choose a subquery over the other expressions. This boils down to whether it's a CORRELATED subquery or not.
Basically - a correlated subquery has to be run once for every row in the outer statement - this is true of the above - for every row in Table B you must run the subquery against Table A.
If the subquery can be run just once
SELECT * from Table_B b WHERE b.id_a IN (SELECT a.id_a FROM Table_A a WHERE a.id_a > 10)
Then the subquery is generally more performant than a join - although I suspect that some optimizers will still be able to spot this and reduce both to the same execution plan.
Again, the best thing to do is to run both statements, and compare execution plans.
Finally and most simply - given the FK you could just write:
SELECT * From Table_B b WHERE b.id_a IS NOT NULL

It will depend on the platform.
SQL Server both analyses the logical implications of constraints (foreign keys, primary keys, etc) and expands VIEWs inline. This means that the 'irrelevant' portion of the VIEW's code is obsoleted by the optimiser. SQL Server would give the exact same execution plan for all three cases. (Note; there is a limit to the complexity that the optimiser can handle, but it can certainly handle this.)
Not all platforms are created equal, however.
- Some may not analyse constraints in the same way, assuming you coded the join for a reason
- Some may pre-compile the VIEW's execution/explain plan
As such, to determine behaviour, you must be aware of the specific platform's capabilties. In the vast majority of situations the optimiser is a complex beast, and so the best test is simply to try it and see.
EDIT
In response to your extra question, are correlated sub-queries every prefered? There is no simple answer, as it depends on the data and the logic you are trying to implement.
There are certainly cases int he past where I have used them, both to simplify a query's structure (for maintainability), but also to enable specific logic.
If the field table_b.id_a referenced many entries in table_a you may only want the name from the latest one. And you could implement that using (SELECT TOP 1 name_a FROM table_a WHERE id_a = table_b.id_a ORDER BY id_a DESC).
In short, it depends.
- On the query's logic
- On the data's structure
- On the code's final layout
More often than not I find it's not needed, but measurably often I find that it is a positive choice.
Note:
Depending on the correlated sub-query, it doesn't actually always get executed 'once for every record'. SQL Server, for example, expands the required logic to be executed in-line with the rest of the query. It's important to note that SQL code is processed/compiled/whatever before being executed. SQL is simply a method for articulating set based logic, which is then transformed into traditional loops, etc, using the most optimal algorithms available to the optimiser.
Other RDBMS may perform differently, due to the capabilities or limitations of the optimiser. Some RDBMS perform well when using IN (SELECT blah FROM blah), or when using EXISTS (SELECT * FROM blah), butsome perform terribly. The same applies to correlated sub-queries. Sub perform exceptionally well with them, some don't perform so well, but most handle the very well in my experience.

Related

JOIN only if boolean is true

Is there a way to hint to Postgres that it should only attempt to do a JOIN lookup if in_table_b is set? This is purely a performance optimization and would not change the results.
Table A:
id Serial
in_table_b Boolean
Table B:
id Int (Foreign Key A.id)
foobar Text
Current query:
SELECT A.id, A.in_table_b, B.foobar FROM A LEFT JOIN B ON a.id = B.id;
I'd like a clause that effectively says: do not try to do an index lookup on B if in_table_b is false. However, I do want to return rows in A even if there is no matching row in B.
You can't really (and, in general, shouldn't try to) control how the database executes the query. SQL is a descriptive language: you tell the database the results that you want, and trust it to make the best decision in terms of execution plan - database designers put lot of effort into building engines that figure out the best possible strategy.
I would just add another condition to the on clause of the left join; this is the simplest way to express what you want:
select a.id, a.in_table_b, b.foobar
from a
left join b on a.id = b.id and a.in_table_b;
I would recommend an index on a(in_table_b, id): this might somehow hint the query planner on how to proceed (the ordering of columns in the index is important here). However, it might also consider that a boolean column is not a very good pick, because the cardinality is not good (there are only two possible values). The database might still go for the index lookup - and actually, this could very well be the most efficient option.

SQL reduce data in join or where

I want to know what is faster, assuming I have the following queries and they retrieve the same data
select * from tableA a inner join tableB b on a.id = b.id where b.columnX = value
or
select * from tableA inner join (select * from tableB where b.columnX = value) b on a.id = b.id
I think makes sense to reduce the dataset from tableB in advanced, but I dont find anything to backup my perception.
In a database such as Teradata, the two should have exactly the same performance characteristics.
SQL is not a procedural language. A SQL query describes the result set. It does not specify the sequence of actions.
SQL engines process a query in three steps:
Parse the query.
Optimize the parsed query.
Execute the optimized query.
The second step gives the engine a lot of flexibility. And most query engines will be quite intelligent about ignoring subqueries, using indexes and partitions based on where clauses, and so on.
Most SQL dialects compile your query into an execution plan. Teradata and most SQL system show the expected execution plan with the "explain" command. Teradata has a visual explain too, which is simple to learn from
It depends on the data volumes and key type in each table, if any method would be advantageous
Most SQL compilers will work this out correctly using the current table statistics (data size and spread)
In some SQL systems your second command would be worse, as it may force a full temporary table build by ALL fields on tableB
It should be (not that I recommend this query style at all)
select * from tableA inner join (select id from tableB where columnX = value) b on a.id = b.id
In most cases, don't worry about this, unless you have a specific performance issue, and then use the explain commands to work out why
A better way in general is to use common table expressions (CTE) to break the problem down. This leads to better queries that can be tested and maintain over the long term
Whenever you come across such scenarios wherein you feel that which query would yeild the results faster in teradata, please use the EXPLAIN plan in teradata - which would properly dictate how the PE is going to retrieve records. If you are using Teradata sql assistant then you can select the query and press F6.
The DBMS decides the access path that will be used to resolve the query, you can't decide it, but you can do certain things like declaring indexes so that the DBMS takes those indexes into consideration when deciding which access path it will use to resolve the query, and then you will get a better performance.
For instance, in this example you are filtering tableB by b.columnX, normally if there are no indexes declared for tableB the DBMS will have to do a full table scan to determine which rows fulfill that condition, but suppose you declare an index on tableB by columnX, in that case the DBMS will probably consider that index and determine an access path that makes use of the index, getting a much better performance than a full table scan, specially if the table is big.

Where (implicit inner join) vs. explicit inner join - does it affect indexing?

For the query
SELECT * from table_a, b WHERE table_a.id = b.id AND table_a.status ='success'
or
SELECT * from a WHERE table_a.status ='success' JOIN b ON table_a.id = b.id
Somehow, i would tend to create one index (id,status) on table_a for the top form
whereas my natural tendency for the bottom form would be to create two separate indices,
id, and status, on table_a.
the two queries are effectively the same, right? would you index both the same way?
how would you index table_a (assuming this is the only query that exists in the system to avoid other considerations)? one or two indices?
The "traditional style" and the SQL 92 style inner join are semantically equivalent, and most DBMS will treat them the same (Oracle, for example, does). They will use the same execution plan for both forms (this is, nevertheless, implementation-dependent, and not guaranteed by any standard).
Hence, indexes are used the same way in both forms, too.
Independently of the syntax you use, the appropriate indexing strategy is implementation-dependent: some DBMS (such as Postgres) generally prefer single-column indexes and can combine them very efficiently, others, such as Oracle, can take more advantage from combined (or even covering) indexes (although both forms work for both DBMS of course).
Regarding the syntax of your example, the position of the second WHERE clause surprises me a little bit.
The following two queries are processed the same way in most DBMS:
SELECT * FROM table_a, b WHERE table_a.id = b.id AND table_a.status ='success'
and
SELECT * FROM a JOIN b ON table_a.id = b.id WHERE table_a.status ='success'
However, your second query shifts the WHERE clause inside the FROM clause, which is no valid SQL in my view.
A quick check for
SELECT * from a WHERE table_a.status ='success' JOIN b ON table_a.id = b.id
confirms: MySQL 5.5, Postgres 9.3, and Oracle 11g all yield a syntax error for it.
The two queries should be optimized to perform the same way; however, the join syntax is ANSI compliant, and the older version should be deprecated. As far as index usage is concerned, you only want to touch a table (index) once. The RDBMS and tabular design you are using will determine the specifics as to whether or not you need to include the PRIMARY KEY (assuming that's what ID represents in your example) in a covering index. Also, SELECT * may or may not be covered; better to use specific column names.
Well you ruled out other queries but there are still open questions: particularly about the data distribution. E.g. how to to number of rows WHERE table_a.status ='success' compare to the table size of table_b? Depending on the optimizers estimates the has to make two important decisions:
Which join algorithm to use (Nested Loops; Hash or Sort/Merge)
In which order to process the table?
Unfortunately these decision affect indexing (and are affected by indexing!)
Example: consider there is only one row WHERE table_a.status ='success'. Than it would be fine to have an index on table_a.status to find that row quickly. Next, we'd like to have an index on table_b.id to find the corresponding rows quickly using a nested loops join. Considering that you select * it doesn't make any sense to include additional columns into these indexes (not considering any other queries in the system).
But now imagine that you don't have an index on table_a.status but on table_a.id and that this table is huge compared to table_b. For demonstration let's assume table_b has only one row (extreme case, of course). Now it would be better to go to table_b, fetch all rows (just one) and than fetch the corresponding rows from table_a using the index. You see how indexing affects the join order? (for a nested loops join in this example)
This is just one simple example how things interact. Most database have three join algorithms to chose from (except MySQL).
If you create the three mentioned indexes and look which way the database executions the join (explain plan) you'll note that one or two of the indexes remains unused for the specific join-algo/join-order selected for your query. In theory, you could drop that indexes. However, keep in mind that the optimizer makes his decision based on the statistics available to him and that the optimizers estimations might be wrong.
You can find more about indexing joins on my web-site: http://use-the-index-luke.com/sql/join

JOIN whole tables vs JOIN tables with less columns (fetched via subquery)

If I were to do a join between two tables, would there be any difference with respect to the fact I joined them as a whole, or I joined them after having extracted only the required columns (making the assumption that each table potentially has many columns)?
As an example, is
SELECT tableA.foreignKey, tableB.someValue
FROM tableA JOIN tableB ON tableA.foreignKey=tableB.key
any different from
SELECT tableA.foreignKey, tableB.someValue
FROM (SELECT foreignKey FROM tableA) tableA_filtered
JOIN (SELECT key, someValue FROM tableB) tableB_filtered
ON tableA_filtered.foreignKey=tableB_filtered.key
performace-wise?
Use the first one, since the second one uses subquery which creates a temporary table for the result. And actually (SELECT valueA FROM tableA) makes no sense at all because you are not aggregating some columns on the table.
Subquery are sometimes evil, not at all times. Tt depends upon the RDBMS you are using.
The general rule is that a subquery will always be slow.
Depending on the amount of data you are processing it could have a big impact.
Reciently i removed a subquery from a large select with alot of joins.
The SQL was processing about 100,000 rows if not more.
Removing the very simple sub select improved performance by 50 seconds.
Overall the sql was taking two minutes. So it had a big impact.
I think in the case that the tables have many columns, the second query could be faster. But its important to note that this two querys are not equivalent. The first one displays you all values from A and B, the second one only valueA from A and valueB from B! Anyway its more a theoretical question and hard to answer generally.
In reality I would leave this decisions to the database optimizer. But if you want to really know if there is a way to make it even faster, the only safe way is to measure and compare the runtimes of both querys.
And as a sidenote, the second query will very probable be flattened by the rewrite engine of your DBMS, so its the same as when you would write:
SELECT valueA, valueB FROM A, B WHERE A.valueA = B.valueB;

order of tables in FROM clause

For an sql query like this.
Select * from TABLE_A a
JOIN TABLE_B b
ON a.propertyA = b.propertyA
JOIN TABLE_C
ON b.propertyB = c.propertyB
Does the sequence of the tables matter. It wont matter in results, but do they affect the performance?
One can assume that the data in table C is much larger that a or b.
For each sql statement, the engine will create a query plan. So no matter how you put them, the engine will chose a correct path to build the query.
More on plans you have http://en.wikipedia.org/wiki/Query_plan
There are ways, considering what RDBMS you are using to enforce the query order and plan, using hints, however, if you feel that the engine does no chose the correct path.
Sometimes Order of table creates a difference here,(when you are using different joins)
Actually our Joins working on Cross Product Concept
If you are using query like this A join B join C
It will be treated like this (A*B)*C)
Means first result comes after joining A and B table then it will make join with C table
So if after inner joining A (100 record) and B (200 record) if it will give (100 record)
And then these ( 100 record ) will compare with (1000 record of C)
No.
Well, there is a very, very tiny chance of this happening, see this article by Jonathan Lewis. Basically, the number of possible join orders grows very quickly, and there's not enough time for the Optimizer to check them all. The sequence of the tables may be used as a tie-breaker in some very rare cases. But I've never seen this happen, or even heard about it happening, to anybody in real life. You don't need to worry about it.