JOIN whole tables vs JOIN tables with less columns (fetched via subquery) - sql

If I were to do a join between two tables, would there be any difference with respect to the fact I joined them as a whole, or I joined them after having extracted only the required columns (making the assumption that each table potentially has many columns)?
As an example, is
SELECT tableA.foreignKey, tableB.someValue
FROM tableA JOIN tableB ON tableA.foreignKey=tableB.key
any different from
SELECT tableA.foreignKey, tableB.someValue
FROM (SELECT foreignKey FROM tableA) tableA_filtered
JOIN (SELECT key, someValue FROM tableB) tableB_filtered
ON tableA_filtered.foreignKey=tableB_filtered.key
performace-wise?

Use the first one, since the second one uses subquery which creates a temporary table for the result. And actually (SELECT valueA FROM tableA) makes no sense at all because you are not aggregating some columns on the table.
Subquery are sometimes evil, not at all times. Tt depends upon the RDBMS you are using.

The general rule is that a subquery will always be slow.
Depending on the amount of data you are processing it could have a big impact.
Reciently i removed a subquery from a large select with alot of joins.
The SQL was processing about 100,000 rows if not more.
Removing the very simple sub select improved performance by 50 seconds.
Overall the sql was taking two minutes. So it had a big impact.

I think in the case that the tables have many columns, the second query could be faster. But its important to note that this two querys are not equivalent. The first one displays you all values from A and B, the second one only valueA from A and valueB from B! Anyway its more a theoretical question and hard to answer generally.
In reality I would leave this decisions to the database optimizer. But if you want to really know if there is a way to make it even faster, the only safe way is to measure and compare the runtimes of both querys.
And as a sidenote, the second query will very probable be flattened by the rewrite engine of your DBMS, so its the same as when you would write:
SELECT valueA, valueB FROM A, B WHERE A.valueA = B.valueB;

Related

Is Except operator computational expensive

I have a table which includes 30 records, and a smaller table has 10 records, both tables have the same schema. All I want to do is to return a table, whose records are in the big table, but not in the small table. The solution I found is to use Except operator. However, when I run the query, it took me about 30 mins. so I am just wondering that if Except is computational expensive and it took a lot of resources?
Is there any functions can replace Except? Thanks for any help !
EXCEPT is a set operator and it should be reasonably optimized. It does remove duplicate values, so there is a bit more overhead than one might expect.
It is not so unoptimized that it would take 30 seconds on such small tables, unless you have columns whose size measures in many megabytes. Something else might be going on -- such as network or server contention.
EXCEPT is a very reasonable approach. NOT IN has a problem with NULL values and only works with one column. NOT EXISTS is going to work best when you have an appropriate index. Under some circumstances, EXCEPT is faster than NOT EXISTS.
In this case, you should be using EXISTS. It is one of the most performant operations in SQL Server
SELECT *
FROM big_table b
WHERE NOT EXISTS (
SELECT 1
FROM small_table s
WHERE s.id = b.id)
There is no need to make things complicated for something so simple.
SELECT * FROM Table1 WHERE ID NOT IN (SELECT ID FROM Table2)

Several layers of nested subqueries with Exists/In, best performance?

I'm working on some rather large queries for a search function. There are a number of different inputs and the queries are pretty big as a result. It's grown to where there are nested subqueries 2 layers deep. Performance has become an issue on the ones that will return a large dataset and likely have to sift through a massive load of records to do so. The ones that have less comparing to do perform fine, but some of these are getting pretty bad. The database is DB2 and has all of the necessary indexes, so that shouldn't be an issue. I'm wondering how to best write/rewrite these queries to perform as I'm not quite sure how the optimizer is going to handle it. I obviously can't dump the whole thing here, but here's an example:
Select A, B
from TableA
--A series of joins--
WHERE TableA.A IN (
Select C
from TableB
--A few joins--
WHERE TableB.C IN (
Select D from TableC
--More joins and conditionals--
)
)
There are also plenty of conditionals sprinkled throughout, the vast majority of which are simple equality. You get the idea. The subqueries do not provide any data to the initial query. They exist only to filter the results. A problem I ran into early on is that the backend is written to contain a number of partial query strings that get assembled into the final query (with 100+ possible combinations due to the search options, it simply isn't feasible to write a query for each), which has complicated the overall method a bit. I'm wondering if EXISTS instead of IN might help at one or both levels, or another bunch of joins instead of subqueries, or perhaps using WITH above the initial query for TableC, etc. I'm definitely looking to remove bottlenecks and would appreciate any feedback that folks might have on how to handle this.
I should probably also add that there are potential unions within both subqueries.
It would probably help to use inner joins instead.
Select A, B
from TableA
inner join TableB on TableA.A = TableB.C
inner join TableC on TableB.C = TableC.D
Databases were designed for joins, but the optimizer might not figure out that it can use an index for a sub-query. Instead it will probably try to run the sub-query, hold the results in memory, and then do a linear search to evaluate the IN operator for every record.
Now, you say that you have all of the necessary indexes. Consider this for a moment.
If one optional condition is TableC.E = 'E' and another optional condition is TableC.F = 'F',
then a query with both would need an index on fields TableC.E AND TableC.F. Many young programmers today think they can have one index on TableC.E and one index on TableC.F, and that's all they need. In fact, if you have both fields in the query, you need an index on both fields.
So, for 100+ combinations, "all of the necessary indexes" could require 100+ indexes.
Now an index on TableC.E, TableC.F could be use in a query with a TableC.E condition and no TableC.F condition, but could not be use when there is a TableC.F condition and no TableC.E condition.
Hundreds of indexes? What am I going to do?
In practice it's not that bad. Let's say you have N optional conditions which are either in the where clause or not. The number of combinations is 2 to the nth, or for hundreds of combinations, N is log2 of the number of combinations, which is between 6 and 10. Also, those log2 conditions are spread across three tables. Some databases support multiple table indexes, but I'm not sure DB2 does, so I'd stick with single table indexes.
So, what I am saying is, say for the TableC.E, and TableC.F example, it's not enough to have just the following indexes:
TableB ON C
TableC ON D
TableC ON E
TableC ON F
For one thing, the optimizer has to pick among which one of the last three indexes to use. Better would be to include the D field in the last two indexes, which gives us
TableB ON C
TableC ON D, E
TableC ON D, F
Here, if neither field E nor F is in the query, it can still index on D, but if either one is in the query, it can index on both D and one other field.
Now suppose you have an index for 10 fields which may or may not be in the query. Why ever have just one field in the index? Why not add other fields in descending order of likelihood of being in the query?
Consider that when planning your indexes.
I found out "IN" predicate is good for small subqueries and "EXISTS" for large subqueries.
Try to execute query with "EXISTS" predicate for large ones.
SELECT A, B
FROM TableA
WHERE EXISTS (
Select C
FROM TableB
WHERE TableB.C = TableA.A)

order of tables in FROM clause

For an sql query like this.
Select * from TABLE_A a
JOIN TABLE_B b
ON a.propertyA = b.propertyA
JOIN TABLE_C
ON b.propertyB = c.propertyB
Does the sequence of the tables matter. It wont matter in results, but do they affect the performance?
One can assume that the data in table C is much larger that a or b.
For each sql statement, the engine will create a query plan. So no matter how you put them, the engine will chose a correct path to build the query.
More on plans you have http://en.wikipedia.org/wiki/Query_plan
There are ways, considering what RDBMS you are using to enforce the query order and plan, using hints, however, if you feel that the engine does no chose the correct path.
Sometimes Order of table creates a difference here,(when you are using different joins)
Actually our Joins working on Cross Product Concept
If you are using query like this A join B join C
It will be treated like this (A*B)*C)
Means first result comes after joining A and B table then it will make join with C table
So if after inner joining A (100 record) and B (200 record) if it will give (100 record)
And then these ( 100 record ) will compare with (1000 record of C)
No.
Well, there is a very, very tiny chance of this happening, see this article by Jonathan Lewis. Basically, the number of possible join orders grows very quickly, and there's not enough time for the Optimizer to check them all. The sequence of the tables may be used as a tie-breaker in some very rare cases. But I've never seen this happen, or even heard about it happening, to anybody in real life. You don't need to worry about it.

Optimize Joins (more than 2 tables, where filter)

Couple of Sybase database query questions:
If I do an join and have a where clause, would the filter be applied prior to the actual join? In other words, is it faster than join without any where conditions?
I have an example involving 3 tables (with columns listed below):
A: O1,....
B: E1,E2,...
C: O1, E2, E2
So my join looks like:
select A.*, B* from B,C,A
where
C.E1=B.E1 and C.E2=B.E2 and C.O1=A.O1
and A.O2 in (...)
and B.E3 in (...)
Would my joins be any significantly faster if I eliminated C and added O1 to table B instead?
B:E1,E2,O1....
First, you should use proper join syntax:
select A.*, B.*
from B join C
on C.E1=B.E1 and C.E2=B.E2 join
A
on C.O1=A.O1
where B.E3 in (...)
The "," means cross join and it is prone to problems since it is easily missed. Your query becomes much different if you say "from B C, A". Also, it gives you access to outer joins, and makes the statement much more readable.
The answer to your question is "yes". Your query will run faster if there are fewer tables being joined. If you are doing a join on a primary key, and the tables fit into memory, then the join is not expensive. However, it is still faster to just find the data in the record in B.
As I say this, there could be some boundary cases in some databases where this is not necessarily true. For instance, if there is only one value in the column, and the column is a long string, then adding the column onto pages could increase the number of pages needed for B. Such extreme cases are unlikely, and you should see performance improvements.
Speed will generally depends on number or rows the SQL Server has to read.
I don't think it makes any difference using a where clause or a join
Depends .. on how many rows are in the eliminated table
It can depend on the order you add the joins or where clauses in, e.g. if there are only a few rows in C, and you add that first as a table or where, it immediately cuts down on the number of matches that are possible in. If, however there are millions of rows in C, then you have to read millions to find the matches.
Modern optimizers can rearrange your query to be more efficient but dont rely on it.
What can really cut down the number of rows read is adding indexes to the join columns - if you have an index on A.O1 AND on C.O1 then it can cut down massivley on the number of reads.

Is there an alternative to joining 3 or more tables?

Is it a good idea to join three or more tables together as in the following example. I'm trying to focus on performance. Is there any way to re-write this query that would be more efficient and faster performing? I've tried to make is as simplistic as possible.
select * from a
join b on a.id = b.id
join c on a.id = c.id
join d on c.id = d.id
where a.property1 = 50
and b.property2 = 4
and c.property3 = 9
and d.property4 = 'square'
If you want faster performance, make sure that all of the join's are covered by an index (either clustered or non-clustered). It looks like this could all be done in your query above by creating an index on the id and appropriate property columns of each table
You could make it faster if you only selected a subset of the columns, at the moment you're selecting everything from all 3 tables.
Performance wise, I think it really depends on the number of records in each table, and making sure that you have the proper indexes defined. (I'm also assuming that SELECT * is a placeholder; you should avoid wildcards)
I'd start off by checking out your execution plan, and start optimizing there. If you're still getting suboptimal performance, you could try using temp tables to break up the 4 table join into separate smaller joins.
Assuming a normalized database, this is the best you can do, in terms of structuring a query and the joins in place.
There are other options to look at, including adding indexes on the different join and select clause columns, denormalizing the table structures and narrowing the result set.
Adding indexes on the join columns (which appear to be primary keys, so may already be indexed) will help with the join performance, indexing the columns in the select clause will help with speeding up the filtering on each table.
If you denormalize, you get a structure with duplicate data with all the implications of duplicate data (data maintenance issues mostly), but you gain performance as you no longer need to join.
When selecting columns, you should specify which ones you want - using * is generally a bad idea. This way you only transfer the data that the application really needs.