JOIN only if boolean is true - sql

Is there a way to hint to Postgres that it should only attempt to do a JOIN lookup if in_table_b is set? This is purely a performance optimization and would not change the results.
Table A:
id Serial
in_table_b Boolean
Table B:
id Int (Foreign Key A.id)
foobar Text
Current query:
SELECT A.id, A.in_table_b, B.foobar FROM A LEFT JOIN B ON a.id = B.id;
I'd like a clause that effectively says: do not try to do an index lookup on B if in_table_b is false. However, I do want to return rows in A even if there is no matching row in B.

You can't really (and, in general, shouldn't try to) control how the database executes the query. SQL is a descriptive language: you tell the database the results that you want, and trust it to make the best decision in terms of execution plan - database designers put lot of effort into building engines that figure out the best possible strategy.
I would just add another condition to the on clause of the left join; this is the simplest way to express what you want:
select a.id, a.in_table_b, b.foobar
from a
left join b on a.id = b.id and a.in_table_b;
I would recommend an index on a(in_table_b, id): this might somehow hint the query planner on how to proceed (the ordering of columns in the index is important here). However, it might also consider that a boolean column is not a very good pick, because the cardinality is not good (there are only two possible values). The database might still go for the index lookup - and actually, this could very well be the most efficient option.

Related

SQL reduce data in join or where

I want to know what is faster, assuming I have the following queries and they retrieve the same data
select * from tableA a inner join tableB b on a.id = b.id where b.columnX = value
or
select * from tableA inner join (select * from tableB where b.columnX = value) b on a.id = b.id
I think makes sense to reduce the dataset from tableB in advanced, but I dont find anything to backup my perception.
In a database such as Teradata, the two should have exactly the same performance characteristics.
SQL is not a procedural language. A SQL query describes the result set. It does not specify the sequence of actions.
SQL engines process a query in three steps:
Parse the query.
Optimize the parsed query.
Execute the optimized query.
The second step gives the engine a lot of flexibility. And most query engines will be quite intelligent about ignoring subqueries, using indexes and partitions based on where clauses, and so on.
Most SQL dialects compile your query into an execution plan. Teradata and most SQL system show the expected execution plan with the "explain" command. Teradata has a visual explain too, which is simple to learn from
It depends on the data volumes and key type in each table, if any method would be advantageous
Most SQL compilers will work this out correctly using the current table statistics (data size and spread)
In some SQL systems your second command would be worse, as it may force a full temporary table build by ALL fields on tableB
It should be (not that I recommend this query style at all)
select * from tableA inner join (select id from tableB where columnX = value) b on a.id = b.id
In most cases, don't worry about this, unless you have a specific performance issue, and then use the explain commands to work out why
A better way in general is to use common table expressions (CTE) to break the problem down. This leads to better queries that can be tested and maintain over the long term
Whenever you come across such scenarios wherein you feel that which query would yeild the results faster in teradata, please use the EXPLAIN plan in teradata - which would properly dictate how the PE is going to retrieve records. If you are using Teradata sql assistant then you can select the query and press F6.
The DBMS decides the access path that will be used to resolve the query, you can't decide it, but you can do certain things like declaring indexes so that the DBMS takes those indexes into consideration when deciding which access path it will use to resolve the query, and then you will get a better performance.
For instance, in this example you are filtering tableB by b.columnX, normally if there are no indexes declared for tableB the DBMS will have to do a full table scan to determine which rows fulfill that condition, but suppose you declare an index on tableB by columnX, in that case the DBMS will probably consider that index and determine an access path that makes use of the index, getting a much better performance than a full table scan, specially if the table is big.

Where (implicit inner join) vs. explicit inner join - does it affect indexing?

For the query
SELECT * from table_a, b WHERE table_a.id = b.id AND table_a.status ='success'
or
SELECT * from a WHERE table_a.status ='success' JOIN b ON table_a.id = b.id
Somehow, i would tend to create one index (id,status) on table_a for the top form
whereas my natural tendency for the bottom form would be to create two separate indices,
id, and status, on table_a.
the two queries are effectively the same, right? would you index both the same way?
how would you index table_a (assuming this is the only query that exists in the system to avoid other considerations)? one or two indices?
The "traditional style" and the SQL 92 style inner join are semantically equivalent, and most DBMS will treat them the same (Oracle, for example, does). They will use the same execution plan for both forms (this is, nevertheless, implementation-dependent, and not guaranteed by any standard).
Hence, indexes are used the same way in both forms, too.
Independently of the syntax you use, the appropriate indexing strategy is implementation-dependent: some DBMS (such as Postgres) generally prefer single-column indexes and can combine them very efficiently, others, such as Oracle, can take more advantage from combined (or even covering) indexes (although both forms work for both DBMS of course).
Regarding the syntax of your example, the position of the second WHERE clause surprises me a little bit.
The following two queries are processed the same way in most DBMS:
SELECT * FROM table_a, b WHERE table_a.id = b.id AND table_a.status ='success'
and
SELECT * FROM a JOIN b ON table_a.id = b.id WHERE table_a.status ='success'
However, your second query shifts the WHERE clause inside the FROM clause, which is no valid SQL in my view.
A quick check for
SELECT * from a WHERE table_a.status ='success' JOIN b ON table_a.id = b.id
confirms: MySQL 5.5, Postgres 9.3, and Oracle 11g all yield a syntax error for it.
The two queries should be optimized to perform the same way; however, the join syntax is ANSI compliant, and the older version should be deprecated. As far as index usage is concerned, you only want to touch a table (index) once. The RDBMS and tabular design you are using will determine the specifics as to whether or not you need to include the PRIMARY KEY (assuming that's what ID represents in your example) in a covering index. Also, SELECT * may or may not be covered; better to use specific column names.
Well you ruled out other queries but there are still open questions: particularly about the data distribution. E.g. how to to number of rows WHERE table_a.status ='success' compare to the table size of table_b? Depending on the optimizers estimates the has to make two important decisions:
Which join algorithm to use (Nested Loops; Hash or Sort/Merge)
In which order to process the table?
Unfortunately these decision affect indexing (and are affected by indexing!)
Example: consider there is only one row WHERE table_a.status ='success'. Than it would be fine to have an index on table_a.status to find that row quickly. Next, we'd like to have an index on table_b.id to find the corresponding rows quickly using a nested loops join. Considering that you select * it doesn't make any sense to include additional columns into these indexes (not considering any other queries in the system).
But now imagine that you don't have an index on table_a.status but on table_a.id and that this table is huge compared to table_b. For demonstration let's assume table_b has only one row (extreme case, of course). Now it would be better to go to table_b, fetch all rows (just one) and than fetch the corresponding rows from table_a using the index. You see how indexing affects the join order? (for a nested loops join in this example)
This is just one simple example how things interact. Most database have three join algorithms to chose from (except MySQL).
If you create the three mentioned indexes and look which way the database executions the join (explain plan) you'll note that one or two of the indexes remains unused for the specific join-algo/join-order selected for your query. In theory, you could drop that indexes. However, keep in mind that the optimizer makes his decision based on the statistics available to him and that the optimizers estimations might be wrong.
You can find more about indexing joins on my web-site: http://use-the-index-luke.com/sql/join

Several layers of nested subqueries with Exists/In, best performance?

I'm working on some rather large queries for a search function. There are a number of different inputs and the queries are pretty big as a result. It's grown to where there are nested subqueries 2 layers deep. Performance has become an issue on the ones that will return a large dataset and likely have to sift through a massive load of records to do so. The ones that have less comparing to do perform fine, but some of these are getting pretty bad. The database is DB2 and has all of the necessary indexes, so that shouldn't be an issue. I'm wondering how to best write/rewrite these queries to perform as I'm not quite sure how the optimizer is going to handle it. I obviously can't dump the whole thing here, but here's an example:
Select A, B
from TableA
--A series of joins--
WHERE TableA.A IN (
Select C
from TableB
--A few joins--
WHERE TableB.C IN (
Select D from TableC
--More joins and conditionals--
)
)
There are also plenty of conditionals sprinkled throughout, the vast majority of which are simple equality. You get the idea. The subqueries do not provide any data to the initial query. They exist only to filter the results. A problem I ran into early on is that the backend is written to contain a number of partial query strings that get assembled into the final query (with 100+ possible combinations due to the search options, it simply isn't feasible to write a query for each), which has complicated the overall method a bit. I'm wondering if EXISTS instead of IN might help at one or both levels, or another bunch of joins instead of subqueries, or perhaps using WITH above the initial query for TableC, etc. I'm definitely looking to remove bottlenecks and would appreciate any feedback that folks might have on how to handle this.
I should probably also add that there are potential unions within both subqueries.
It would probably help to use inner joins instead.
Select A, B
from TableA
inner join TableB on TableA.A = TableB.C
inner join TableC on TableB.C = TableC.D
Databases were designed for joins, but the optimizer might not figure out that it can use an index for a sub-query. Instead it will probably try to run the sub-query, hold the results in memory, and then do a linear search to evaluate the IN operator for every record.
Now, you say that you have all of the necessary indexes. Consider this for a moment.
If one optional condition is TableC.E = 'E' and another optional condition is TableC.F = 'F',
then a query with both would need an index on fields TableC.E AND TableC.F. Many young programmers today think they can have one index on TableC.E and one index on TableC.F, and that's all they need. In fact, if you have both fields in the query, you need an index on both fields.
So, for 100+ combinations, "all of the necessary indexes" could require 100+ indexes.
Now an index on TableC.E, TableC.F could be use in a query with a TableC.E condition and no TableC.F condition, but could not be use when there is a TableC.F condition and no TableC.E condition.
Hundreds of indexes? What am I going to do?
In practice it's not that bad. Let's say you have N optional conditions which are either in the where clause or not. The number of combinations is 2 to the nth, or for hundreds of combinations, N is log2 of the number of combinations, which is between 6 and 10. Also, those log2 conditions are spread across three tables. Some databases support multiple table indexes, but I'm not sure DB2 does, so I'd stick with single table indexes.
So, what I am saying is, say for the TableC.E, and TableC.F example, it's not enough to have just the following indexes:
TableB ON C
TableC ON D
TableC ON E
TableC ON F
For one thing, the optimizer has to pick among which one of the last three indexes to use. Better would be to include the D field in the last two indexes, which gives us
TableB ON C
TableC ON D, E
TableC ON D, F
Here, if neither field E nor F is in the query, it can still index on D, but if either one is in the query, it can index on both D and one other field.
Now suppose you have an index for 10 fields which may or may not be in the query. Why ever have just one field in the index? Why not add other fields in descending order of likelihood of being in the query?
Consider that when planning your indexes.
I found out "IN" predicate is good for small subqueries and "EXISTS" for large subqueries.
Try to execute query with "EXISTS" predicate for large ones.
SELECT A, B
FROM TableA
WHERE EXISTS (
Select C
FROM TableB
WHERE TableB.C = TableA.A)

order of tables in FROM clause

For an sql query like this.
Select * from TABLE_A a
JOIN TABLE_B b
ON a.propertyA = b.propertyA
JOIN TABLE_C
ON b.propertyB = c.propertyB
Does the sequence of the tables matter. It wont matter in results, but do they affect the performance?
One can assume that the data in table C is much larger that a or b.
For each sql statement, the engine will create a query plan. So no matter how you put them, the engine will chose a correct path to build the query.
More on plans you have http://en.wikipedia.org/wiki/Query_plan
There are ways, considering what RDBMS you are using to enforce the query order and plan, using hints, however, if you feel that the engine does no chose the correct path.
Sometimes Order of table creates a difference here,(when you are using different joins)
Actually our Joins working on Cross Product Concept
If you are using query like this A join B join C
It will be treated like this (A*B)*C)
Means first result comes after joining A and B table then it will make join with C table
So if after inner joining A (100 record) and B (200 record) if it will give (100 record)
And then these ( 100 record ) will compare with (1000 record of C)
No.
Well, there is a very, very tiny chance of this happening, see this article by Jonathan Lewis. Basically, the number of possible join orders grows very quickly, and there's not enough time for the Optimizer to check them all. The sequence of the tables may be used as a tie-breaker in some very rare cases. But I've never seen this happen, or even heard about it happening, to anybody in real life. You don't need to worry about it.

Many to one joins

Lets say I have the following tables:
create table table_a
(
id_a,
name_a,
primary_key (id_a)
);
create table table_b
(
id_b,
id_a is not null, -- (Edit)
name_b,
primary_key (id_b),
foreign_key (id_a) references table_a (id_a)
);
We can create a join view on these tables in a number of ways:
create view join_1 as
(
select
b.id_b,
b.id_a,
b.name_b,
a.name_a
from table_a a, table_b b
where a.id_a = b.id_a
);
create view join_2 as
(
select
b.id_b,
b.id_a,
b.name_b,
a.name_a
from table_b b left outer join table_a a
on a.id_a = b.id_a
);
create view join_3 as
(
select
b.id_b,
b.id_a,
b.name_b,
(select a.name_a from table_a a where b.id_b = a.id_a) as name_a
from table_b b;
);
Here we know:
(1) There must be at least one entry from table_a with id_a (due to the foreign key in table B) AND
(2) At most one entry from table_a with id_a (due to the primary key on table A)
then we know that there is exactly one entry in table_a that links in with the join.
Now consider the following SQL:
select id_b, name_b from join_X;
Note that this selects no columns from table_a, and because we know in this join, table_b joins to exactly one we really shouldn't have to look at table_a when performing the above select.
So what is the best way to write the above join view?
Should I just use the standard join_1, and hope the optimizer figures out that there is no need to access table_a based on the primary and foreign keys?
Or is it better to write it like join_2 or even join_3, which makes it more explicit that there is exactly one row from the join for each row from table_b?
Edit + additional question
Is there any time I should prefer a sub-select (as in join_3) over a normal join (as in join_1)?
Why are you using views at all for this purpose? If you want to get data from the table, get it from the table.
Or, if you need a view to translate some columns in the tables (such as coalescing NULLs into zeros), create a view over just the B table. This will also apply if some DBA wanna-be has implemented a policy that all selects must be via views rather than tables :-)
In both those cases, you don't have to worry about multi-table access.
Intuitively, I'd think join_1 will perform slightly slower, because your assumption that the optimiser can transform the join away is wrong as you didn't declare the table_b.id_a column to be NOT NULL. In fact, this means that (1) is wrong. table_b.id_a can be NULL. Even if you know it can't be, the optimiser doesn't know that.
As far as join_2 and join_3 is concerned, depending on your database, optimisation might be possible. The best way to find out is to run (Oracle syntax)
EXPLAIN select id_b, name_b from join_X;
And study the execution plan. It will tell you whether table_a was joined or not. On the other hand, if your view should be reusable, then I'd go for the plain join and forget about pre-mature optimisations. You can achieve better results with proper statistics and indexes, as a join operation is not always so expensive. But that depends on your statistics, of course.
1 + 2 are effectively identical under SQL Server.
I have never used [3], but it looks quite odd. I would strongly suspect the optimizer will make it equivalent to the other 2.
It's a good exercise to run all 3 statements and compare the execution plans produced.
So given identical performance, the clearest to read gets my vote - [2] is the standard where it is supported otherwise [1].
In your case, if you don't want any columns from A, why include Table_A in the statement anyway?
If it's simply a filter - i.e. only include rows from Table B where a row exists in Table A even though I don't want any cols from Table A, then all 3 syntaxes are fine, although you may find that use of IF EXISTS is more performant in some dbs:
SELECT * from Table_B b WHERE EXISTS (SELECT 1 FROM Table_A a WHERE b.id_b = a.id_a)
although in my experience this is usual equivalent in performance to any of the others.
You also ask, then would you choose a subquery over the other expressions. This boils down to whether it's a CORRELATED subquery or not.
Basically - a correlated subquery has to be run once for every row in the outer statement - this is true of the above - for every row in Table B you must run the subquery against Table A.
If the subquery can be run just once
SELECT * from Table_B b WHERE b.id_a IN (SELECT a.id_a FROM Table_A a WHERE a.id_a > 10)
Then the subquery is generally more performant than a join - although I suspect that some optimizers will still be able to spot this and reduce both to the same execution plan.
Again, the best thing to do is to run both statements, and compare execution plans.
Finally and most simply - given the FK you could just write:
SELECT * From Table_B b WHERE b.id_a IS NOT NULL
It will depend on the platform.
SQL Server both analyses the logical implications of constraints (foreign keys, primary keys, etc) and expands VIEWs inline. This means that the 'irrelevant' portion of the VIEW's code is obsoleted by the optimiser. SQL Server would give the exact same execution plan for all three cases. (Note; there is a limit to the complexity that the optimiser can handle, but it can certainly handle this.)
Not all platforms are created equal, however.
- Some may not analyse constraints in the same way, assuming you coded the join for a reason
- Some may pre-compile the VIEW's execution/explain plan
As such, to determine behaviour, you must be aware of the specific platform's capabilties. In the vast majority of situations the optimiser is a complex beast, and so the best test is simply to try it and see.
EDIT
In response to your extra question, are correlated sub-queries every prefered? There is no simple answer, as it depends on the data and the logic you are trying to implement.
There are certainly cases int he past where I have used them, both to simplify a query's structure (for maintainability), but also to enable specific logic.
If the field table_b.id_a referenced many entries in table_a you may only want the name from the latest one. And you could implement that using (SELECT TOP 1 name_a FROM table_a WHERE id_a = table_b.id_a ORDER BY id_a DESC).
In short, it depends.
- On the query's logic
- On the data's structure
- On the code's final layout
More often than not I find it's not needed, but measurably often I find that it is a positive choice.
Note:
Depending on the correlated sub-query, it doesn't actually always get executed 'once for every record'. SQL Server, for example, expands the required logic to be executed in-line with the rest of the query. It's important to note that SQL code is processed/compiled/whatever before being executed. SQL is simply a method for articulating set based logic, which is then transformed into traditional loops, etc, using the most optimal algorithms available to the optimiser.
Other RDBMS may perform differently, due to the capabilities or limitations of the optimiser. Some RDBMS perform well when using IN (SELECT blah FROM blah), or when using EXISTS (SELECT * FROM blah), butsome perform terribly. The same applies to correlated sub-queries. Sub perform exceptionally well with them, some don't perform so well, but most handle the very well in my experience.