Where (implicit inner join) vs. explicit inner join - does it affect indexing? - sql

For the query
SELECT * from table_a, b WHERE table_a.id = b.id AND table_a.status ='success'
or
SELECT * from a WHERE table_a.status ='success' JOIN b ON table_a.id = b.id
Somehow, i would tend to create one index (id,status) on table_a for the top form
whereas my natural tendency for the bottom form would be to create two separate indices,
id, and status, on table_a.
the two queries are effectively the same, right? would you index both the same way?
how would you index table_a (assuming this is the only query that exists in the system to avoid other considerations)? one or two indices?

The "traditional style" and the SQL 92 style inner join are semantically equivalent, and most DBMS will treat them the same (Oracle, for example, does). They will use the same execution plan for both forms (this is, nevertheless, implementation-dependent, and not guaranteed by any standard).
Hence, indexes are used the same way in both forms, too.
Independently of the syntax you use, the appropriate indexing strategy is implementation-dependent: some DBMS (such as Postgres) generally prefer single-column indexes and can combine them very efficiently, others, such as Oracle, can take more advantage from combined (or even covering) indexes (although both forms work for both DBMS of course).
Regarding the syntax of your example, the position of the second WHERE clause surprises me a little bit.
The following two queries are processed the same way in most DBMS:
SELECT * FROM table_a, b WHERE table_a.id = b.id AND table_a.status ='success'
and
SELECT * FROM a JOIN b ON table_a.id = b.id WHERE table_a.status ='success'
However, your second query shifts the WHERE clause inside the FROM clause, which is no valid SQL in my view.
A quick check for
SELECT * from a WHERE table_a.status ='success' JOIN b ON table_a.id = b.id
confirms: MySQL 5.5, Postgres 9.3, and Oracle 11g all yield a syntax error for it.

The two queries should be optimized to perform the same way; however, the join syntax is ANSI compliant, and the older version should be deprecated. As far as index usage is concerned, you only want to touch a table (index) once. The RDBMS and tabular design you are using will determine the specifics as to whether or not you need to include the PRIMARY KEY (assuming that's what ID represents in your example) in a covering index. Also, SELECT * may or may not be covered; better to use specific column names.

Well you ruled out other queries but there are still open questions: particularly about the data distribution. E.g. how to to number of rows WHERE table_a.status ='success' compare to the table size of table_b? Depending on the optimizers estimates the has to make two important decisions:
Which join algorithm to use (Nested Loops; Hash or Sort/Merge)
In which order to process the table?
Unfortunately these decision affect indexing (and are affected by indexing!)
Example: consider there is only one row WHERE table_a.status ='success'. Than it would be fine to have an index on table_a.status to find that row quickly. Next, we'd like to have an index on table_b.id to find the corresponding rows quickly using a nested loops join. Considering that you select * it doesn't make any sense to include additional columns into these indexes (not considering any other queries in the system).
But now imagine that you don't have an index on table_a.status but on table_a.id and that this table is huge compared to table_b. For demonstration let's assume table_b has only one row (extreme case, of course). Now it would be better to go to table_b, fetch all rows (just one) and than fetch the corresponding rows from table_a using the index. You see how indexing affects the join order? (for a nested loops join in this example)
This is just one simple example how things interact. Most database have three join algorithms to chose from (except MySQL).
If you create the three mentioned indexes and look which way the database executions the join (explain plan) you'll note that one or two of the indexes remains unused for the specific join-algo/join-order selected for your query. In theory, you could drop that indexes. However, keep in mind that the optimizer makes his decision based on the statistics available to him and that the optimizers estimations might be wrong.
You can find more about indexing joins on my web-site: http://use-the-index-luke.com/sql/join

Related

JOIN only if boolean is true

Is there a way to hint to Postgres that it should only attempt to do a JOIN lookup if in_table_b is set? This is purely a performance optimization and would not change the results.
Table A:
id Serial
in_table_b Boolean
Table B:
id Int (Foreign Key A.id)
foobar Text
Current query:
SELECT A.id, A.in_table_b, B.foobar FROM A LEFT JOIN B ON a.id = B.id;
I'd like a clause that effectively says: do not try to do an index lookup on B if in_table_b is false. However, I do want to return rows in A even if there is no matching row in B.
You can't really (and, in general, shouldn't try to) control how the database executes the query. SQL is a descriptive language: you tell the database the results that you want, and trust it to make the best decision in terms of execution plan - database designers put lot of effort into building engines that figure out the best possible strategy.
I would just add another condition to the on clause of the left join; this is the simplest way to express what you want:
select a.id, a.in_table_b, b.foobar
from a
left join b on a.id = b.id and a.in_table_b;
I would recommend an index on a(in_table_b, id): this might somehow hint the query planner on how to proceed (the ordering of columns in the index is important here). However, it might also consider that a boolean column is not a very good pick, because the cardinality is not good (there are only two possible values). The database might still go for the index lookup - and actually, this could very well be the most efficient option.

SQL reduce data in join or where

I want to know what is faster, assuming I have the following queries and they retrieve the same data
select * from tableA a inner join tableB b on a.id = b.id where b.columnX = value
or
select * from tableA inner join (select * from tableB where b.columnX = value) b on a.id = b.id
I think makes sense to reduce the dataset from tableB in advanced, but I dont find anything to backup my perception.
In a database such as Teradata, the two should have exactly the same performance characteristics.
SQL is not a procedural language. A SQL query describes the result set. It does not specify the sequence of actions.
SQL engines process a query in three steps:
Parse the query.
Optimize the parsed query.
Execute the optimized query.
The second step gives the engine a lot of flexibility. And most query engines will be quite intelligent about ignoring subqueries, using indexes and partitions based on where clauses, and so on.
Most SQL dialects compile your query into an execution plan. Teradata and most SQL system show the expected execution plan with the "explain" command. Teradata has a visual explain too, which is simple to learn from
It depends on the data volumes and key type in each table, if any method would be advantageous
Most SQL compilers will work this out correctly using the current table statistics (data size and spread)
In some SQL systems your second command would be worse, as it may force a full temporary table build by ALL fields on tableB
It should be (not that I recommend this query style at all)
select * from tableA inner join (select id from tableB where columnX = value) b on a.id = b.id
In most cases, don't worry about this, unless you have a specific performance issue, and then use the explain commands to work out why
A better way in general is to use common table expressions (CTE) to break the problem down. This leads to better queries that can be tested and maintain over the long term
Whenever you come across such scenarios wherein you feel that which query would yeild the results faster in teradata, please use the EXPLAIN plan in teradata - which would properly dictate how the PE is going to retrieve records. If you are using Teradata sql assistant then you can select the query and press F6.
The DBMS decides the access path that will be used to resolve the query, you can't decide it, but you can do certain things like declaring indexes so that the DBMS takes those indexes into consideration when deciding which access path it will use to resolve the query, and then you will get a better performance.
For instance, in this example you are filtering tableB by b.columnX, normally if there are no indexes declared for tableB the DBMS will have to do a full table scan to determine which rows fulfill that condition, but suppose you declare an index on tableB by columnX, in that case the DBMS will probably consider that index and determine an access path that makes use of the index, getting a much better performance than a full table scan, specially if the table is big.

is it true if we switch the position of table in join query will increase load data speed?

For an example:
In table a we have 1000000 rows
In table b we have 5 rows
It's more faster if we use
select * from b inner join a on b.id = a.id
than
select * from a inner join b on a.id = b.id
No, JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff. JOIN by order is changed during optimization.
You might test it all by yourself, download some test databases like AdventureWorks or Northwind or try it on your database, you might do this:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.
It is unlikely. There are lots of factors on the speed of joining two tables. That is why database engines have an optimization phase, where they consider different ways of implementing the query.
There are many different options:
Nested loops, scanning b first and then a.
Nested loops, scanning a first and then b.
Sorting both tables and using a merge join.
Hashing both tables and using a hash join.
Using an index on b.id.
Using an index on a.id.
And these are just high level descriptions -- there are multiple ways to implement some of these methods. Tables can also be partitioned adding further complexity.
Join order is just one consideration.
In this case, the result of the query is likely to depend on the size of the data being returned, rather than the actual algorithm used for fetching the data.

Is this index defined correctly for this join usage? (Postgres)

select
*
from
tbl1 as a
inner join
tbl2 as b on
tbl1.id=b.id
left join
tbl3 as c on
tbl2.id=tb3.parent_id and
tb3.some_col=2 and
tb3.attribute_id=3
In the example above:
If I want optimal performance on the join, should I set the index on tbl3 as so?
parent_id,
some_col,
attribute_id
The answer depends on the chosen join type.
If PostgreSQL chooses a nested loop or a merge outer join, your index is perfect.
If PostgreSQL chooses a hash outer join, the index won't help at all. In that case you need an index on (some_col, attribute_id).
Work with EXPLAIN to make the best choice for your case.
Note: If one of the conditions on some_col and attribute_id is not selective (doesn't filter out a significant number of rows), it is often better to omit that column in the index. In that case, it is better to get the benefit of a smaller index and more HOT updates.
My answer is "Maybe". I am speaking from experience with SQL Server, so someone please correct me if I am wrong and it is different in Postgres.
Your index looks fine for the most part. An issue that may arise is using the SELECT *. If tbl3 has more columns than what is defined in your index and you are querying those fields, they won't be in your index and the engine will have to do additional lookups outside that index.
Another thing would be based on the cardinality of your fields, meaning which are the most selective. If parent_id has a high cardinality, meaning very few duplicates, it could cause more reads against the index. However, if your lowest cardinality field is first and the db can quickly filter out huge chunks of data, that might be more efficient.
I have seen both work very well in SQL Server. SQL Server has even recommended indexes, I apply them, and then it recommends a different one based on field cardinality. Again, I am not familiar with the Postgres engine and I am just assuming these topics apply across both. If all else fails, create 3 indexes with different column order and see which one the engine likes the best.

Many to one joins

Lets say I have the following tables:
create table table_a
(
id_a,
name_a,
primary_key (id_a)
);
create table table_b
(
id_b,
id_a is not null, -- (Edit)
name_b,
primary_key (id_b),
foreign_key (id_a) references table_a (id_a)
);
We can create a join view on these tables in a number of ways:
create view join_1 as
(
select
b.id_b,
b.id_a,
b.name_b,
a.name_a
from table_a a, table_b b
where a.id_a = b.id_a
);
create view join_2 as
(
select
b.id_b,
b.id_a,
b.name_b,
a.name_a
from table_b b left outer join table_a a
on a.id_a = b.id_a
);
create view join_3 as
(
select
b.id_b,
b.id_a,
b.name_b,
(select a.name_a from table_a a where b.id_b = a.id_a) as name_a
from table_b b;
);
Here we know:
(1) There must be at least one entry from table_a with id_a (due to the foreign key in table B) AND
(2) At most one entry from table_a with id_a (due to the primary key on table A)
then we know that there is exactly one entry in table_a that links in with the join.
Now consider the following SQL:
select id_b, name_b from join_X;
Note that this selects no columns from table_a, and because we know in this join, table_b joins to exactly one we really shouldn't have to look at table_a when performing the above select.
So what is the best way to write the above join view?
Should I just use the standard join_1, and hope the optimizer figures out that there is no need to access table_a based on the primary and foreign keys?
Or is it better to write it like join_2 or even join_3, which makes it more explicit that there is exactly one row from the join for each row from table_b?
Edit + additional question
Is there any time I should prefer a sub-select (as in join_3) over a normal join (as in join_1)?
Why are you using views at all for this purpose? If you want to get data from the table, get it from the table.
Or, if you need a view to translate some columns in the tables (such as coalescing NULLs into zeros), create a view over just the B table. This will also apply if some DBA wanna-be has implemented a policy that all selects must be via views rather than tables :-)
In both those cases, you don't have to worry about multi-table access.
Intuitively, I'd think join_1 will perform slightly slower, because your assumption that the optimiser can transform the join away is wrong as you didn't declare the table_b.id_a column to be NOT NULL. In fact, this means that (1) is wrong. table_b.id_a can be NULL. Even if you know it can't be, the optimiser doesn't know that.
As far as join_2 and join_3 is concerned, depending on your database, optimisation might be possible. The best way to find out is to run (Oracle syntax)
EXPLAIN select id_b, name_b from join_X;
And study the execution plan. It will tell you whether table_a was joined or not. On the other hand, if your view should be reusable, then I'd go for the plain join and forget about pre-mature optimisations. You can achieve better results with proper statistics and indexes, as a join operation is not always so expensive. But that depends on your statistics, of course.
1 + 2 are effectively identical under SQL Server.
I have never used [3], but it looks quite odd. I would strongly suspect the optimizer will make it equivalent to the other 2.
It's a good exercise to run all 3 statements and compare the execution plans produced.
So given identical performance, the clearest to read gets my vote - [2] is the standard where it is supported otherwise [1].
In your case, if you don't want any columns from A, why include Table_A in the statement anyway?
If it's simply a filter - i.e. only include rows from Table B where a row exists in Table A even though I don't want any cols from Table A, then all 3 syntaxes are fine, although you may find that use of IF EXISTS is more performant in some dbs:
SELECT * from Table_B b WHERE EXISTS (SELECT 1 FROM Table_A a WHERE b.id_b = a.id_a)
although in my experience this is usual equivalent in performance to any of the others.
You also ask, then would you choose a subquery over the other expressions. This boils down to whether it's a CORRELATED subquery or not.
Basically - a correlated subquery has to be run once for every row in the outer statement - this is true of the above - for every row in Table B you must run the subquery against Table A.
If the subquery can be run just once
SELECT * from Table_B b WHERE b.id_a IN (SELECT a.id_a FROM Table_A a WHERE a.id_a > 10)
Then the subquery is generally more performant than a join - although I suspect that some optimizers will still be able to spot this and reduce both to the same execution plan.
Again, the best thing to do is to run both statements, and compare execution plans.
Finally and most simply - given the FK you could just write:
SELECT * From Table_B b WHERE b.id_a IS NOT NULL
It will depend on the platform.
SQL Server both analyses the logical implications of constraints (foreign keys, primary keys, etc) and expands VIEWs inline. This means that the 'irrelevant' portion of the VIEW's code is obsoleted by the optimiser. SQL Server would give the exact same execution plan for all three cases. (Note; there is a limit to the complexity that the optimiser can handle, but it can certainly handle this.)
Not all platforms are created equal, however.
- Some may not analyse constraints in the same way, assuming you coded the join for a reason
- Some may pre-compile the VIEW's execution/explain plan
As such, to determine behaviour, you must be aware of the specific platform's capabilties. In the vast majority of situations the optimiser is a complex beast, and so the best test is simply to try it and see.
EDIT
In response to your extra question, are correlated sub-queries every prefered? There is no simple answer, as it depends on the data and the logic you are trying to implement.
There are certainly cases int he past where I have used them, both to simplify a query's structure (for maintainability), but also to enable specific logic.
If the field table_b.id_a referenced many entries in table_a you may only want the name from the latest one. And you could implement that using (SELECT TOP 1 name_a FROM table_a WHERE id_a = table_b.id_a ORDER BY id_a DESC).
In short, it depends.
- On the query's logic
- On the data's structure
- On the code's final layout
More often than not I find it's not needed, but measurably often I find that it is a positive choice.
Note:
Depending on the correlated sub-query, it doesn't actually always get executed 'once for every record'. SQL Server, for example, expands the required logic to be executed in-line with the rest of the query. It's important to note that SQL code is processed/compiled/whatever before being executed. SQL is simply a method for articulating set based logic, which is then transformed into traditional loops, etc, using the most optimal algorithms available to the optimiser.
Other RDBMS may perform differently, due to the capabilities or limitations of the optimiser. Some RDBMS perform well when using IN (SELECT blah FROM blah), or when using EXISTS (SELECT * FROM blah), butsome perform terribly. The same applies to correlated sub-queries. Sub perform exceptionally well with them, some don't perform so well, but most handle the very well in my experience.