SQL Server Logical Query Processing - how does it manage the huge datasets? - sql

I'm doing some reading on SQL Server performance:
http://www.amazon.com/Inside-Microsoft-SQL-Server-2005/dp/0735623139/ref=sr_1_6?ie=UTF8&s=books&qid=1267032068&sr=8-6
One of the surprising things I came across was how it processes the "FROM" phase in its Logical Processing. From what I understand, SQL Server will do the following:
1) For the first two tables, it will create a virtual table (VT1) consisting of a Cartesian join of the two tables
2) For every additional table, it will create a Cartesian join of VT1 and the additional table, with the result becoming VT1
I'm sure there is alot more to it under the covers, but at face value, this seems like it would involve a huge amount of processing/memory if you're dealing with big tables (and big queries).
I was just wondering whether anyone had a quick explanation of how SQL Server is able to do this in any sort of realistic time/space frame?

The carthesian join is just a description of the result, not an actual result. After the full carthesian join of tables A, B, C...X, the filter operators are applied (still as a definition), things like ON clauses of the join and WHERE clauses of the query. In the end this definition is in turn transformed into an execution plan, which will contain physicall operators like Nested Loops or Hash Join or Merge Join, and this operators, when iterated, will produce the results as requested in the query definition.
So the big 100x100x100x100... carthesian cube is never materialized, is just a definition.

If you are really interested in how SQL Server does what it does, please read this book:
http://www.amazon.com/Microsoft-SQL-Server-2008-Internals/dp/0735626243/ref=sr_1_1?ie=UTF8&s=books&qid=1267033666&sr=8-1

In reality the optimiser will look at the whole query, estimated rows, statistics, constraints etc
Logically, it is in the order mentioned though
Contrived example:
SELECT
BT.col1, LT.col2
FROm
BigTable BT
JOIN
LT.Table LT ON BT.FKCol = LT.PKCol
WHERE
LT.PKCol = 2
ORDER BY
BT.col1
The cartesian of BT and LT could be 100s of millions.
But the optimiser:
knows PKCol is unique so it expects only one row
can use statistics to estimate the number of rows from BT
looks for indexes (eg covering index on BT for FKCol INLCUDE col1)
will probably apply the WHERE first
will look ahead for an ORDER BY or GROUP BY for example to see if it can save some spooling (resorting)

I don't know the resource you are reading, but what you describe is the behavior of:
SELECT ... FROM tableA, tableB, tableC, ....
This uses a cartesian join (also called a cross join) and is very expensive. With large enough datasets SQL Server (or any RDBMS) can't do this in any sort of realistic time/space frame.
Using an ON clause and specifying the JOIN type performs vastly better:
SELECT ... FROM tableA JOIN tableB on tableB.a_id = tableA.a_id
In real applications cross joins should be rare or at least limited to very small datasets. For many applications it's not uncommon to never have a cross join.

Related

SQL reduce data in join or where

I want to know what is faster, assuming I have the following queries and they retrieve the same data
select * from tableA a inner join tableB b on a.id = b.id where b.columnX = value
or
select * from tableA inner join (select * from tableB where b.columnX = value) b on a.id = b.id
I think makes sense to reduce the dataset from tableB in advanced, but I dont find anything to backup my perception.
In a database such as Teradata, the two should have exactly the same performance characteristics.
SQL is not a procedural language. A SQL query describes the result set. It does not specify the sequence of actions.
SQL engines process a query in three steps:
Parse the query.
Optimize the parsed query.
Execute the optimized query.
The second step gives the engine a lot of flexibility. And most query engines will be quite intelligent about ignoring subqueries, using indexes and partitions based on where clauses, and so on.
Most SQL dialects compile your query into an execution plan. Teradata and most SQL system show the expected execution plan with the "explain" command. Teradata has a visual explain too, which is simple to learn from
It depends on the data volumes and key type in each table, if any method would be advantageous
Most SQL compilers will work this out correctly using the current table statistics (data size and spread)
In some SQL systems your second command would be worse, as it may force a full temporary table build by ALL fields on tableB
It should be (not that I recommend this query style at all)
select * from tableA inner join (select id from tableB where columnX = value) b on a.id = b.id
In most cases, don't worry about this, unless you have a specific performance issue, and then use the explain commands to work out why
A better way in general is to use common table expressions (CTE) to break the problem down. This leads to better queries that can be tested and maintain over the long term
Whenever you come across such scenarios wherein you feel that which query would yeild the results faster in teradata, please use the EXPLAIN plan in teradata - which would properly dictate how the PE is going to retrieve records. If you are using Teradata sql assistant then you can select the query and press F6.
The DBMS decides the access path that will be used to resolve the query, you can't decide it, but you can do certain things like declaring indexes so that the DBMS takes those indexes into consideration when deciding which access path it will use to resolve the query, and then you will get a better performance.
For instance, in this example you are filtering tableB by b.columnX, normally if there are no indexes declared for tableB the DBMS will have to do a full table scan to determine which rows fulfill that condition, but suppose you declare an index on tableB by columnX, in that case the DBMS will probably consider that index and determine an access path that makes use of the index, getting a much better performance than a full table scan, specially if the table is big.

is it true if we switch the position of table in join query will increase load data speed?

For an example:
In table a we have 1000000 rows
In table b we have 5 rows
It's more faster if we use
select * from b inner join a on b.id = a.id
than
select * from a inner join b on a.id = b.id
No, JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff. JOIN by order is changed during optimization.
You might test it all by yourself, download some test databases like AdventureWorks or Northwind or try it on your database, you might do this:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.
It is unlikely. There are lots of factors on the speed of joining two tables. That is why database engines have an optimization phase, where they consider different ways of implementing the query.
There are many different options:
Nested loops, scanning b first and then a.
Nested loops, scanning a first and then b.
Sorting both tables and using a merge join.
Hashing both tables and using a hash join.
Using an index on b.id.
Using an index on a.id.
And these are just high level descriptions -- there are multiple ways to implement some of these methods. Tables can also be partitioned adding further complexity.
Join order is just one consideration.
In this case, the result of the query is likely to depend on the size of the data being returned, rather than the actual algorithm used for fetching the data.

Inner join vs separate statements

I am trying to get some more context on this out of curiosity. So far when I run 2 separate sql statements I find in SQL Profiler that I have no CPU cycles, less reads and less duration than taking the script and using Inner join. Is this a typical case, I am looking for help to understand this better.
Simple example:
SELECT * FROM dbo.ChargeCode
SELECT * FROM dbo.ChargeCodeGroup
vs
SELECT *
FROM dbo.ChargeCode c
INNER JOIN dbo.ChargeCodeGroup cc ON c.ChargeCodeGroupID = cc.ChargeCodeGroupID
From what I guess is that inner join cost extra CPU cycles because its doing a nested loop. Am I on the right track with this?
The simple answer is that you're doing two different things here. In your 1st example you're retrieving 2 separate entities. In your second example, your asking the RDBMS to combine (join) 2 entities into a single result set.
A join is one of the most powerful capabilities of an RDBMS - and it will (usually) do it as efficiently as it possibly can - but that's not to say it's free or cheap.
SELECT * FROM sometable
must scan whole table.
If there are indexes on ChargeCodeGroupID column on either table, it will be much faster for INNER JOIN to only scan index. (By their name, I guess there are). Of course, if there is no index on either ChargeCodeGroupID column, second query is slower than the first one.

order of tables in FROM clause

For an sql query like this.
Select * from TABLE_A a
JOIN TABLE_B b
ON a.propertyA = b.propertyA
JOIN TABLE_C
ON b.propertyB = c.propertyB
Does the sequence of the tables matter. It wont matter in results, but do they affect the performance?
One can assume that the data in table C is much larger that a or b.
For each sql statement, the engine will create a query plan. So no matter how you put them, the engine will chose a correct path to build the query.
More on plans you have http://en.wikipedia.org/wiki/Query_plan
There are ways, considering what RDBMS you are using to enforce the query order and plan, using hints, however, if you feel that the engine does no chose the correct path.
Sometimes Order of table creates a difference here,(when you are using different joins)
Actually our Joins working on Cross Product Concept
If you are using query like this A join B join C
It will be treated like this (A*B)*C)
Means first result comes after joining A and B table then it will make join with C table
So if after inner joining A (100 record) and B (200 record) if it will give (100 record)
And then these ( 100 record ) will compare with (1000 record of C)
No.
Well, there is a very, very tiny chance of this happening, see this article by Jonathan Lewis. Basically, the number of possible join orders grows very quickly, and there's not enough time for the Optimizer to check them all. The sequence of the tables may be used as a tie-breaker in some very rare cases. But I've never seen this happen, or even heard about it happening, to anybody in real life. You don't need to worry about it.

Optimize Joins (more than 2 tables, where filter)

Couple of Sybase database query questions:
If I do an join and have a where clause, would the filter be applied prior to the actual join? In other words, is it faster than join without any where conditions?
I have an example involving 3 tables (with columns listed below):
A: O1,....
B: E1,E2,...
C: O1, E2, E2
So my join looks like:
select A.*, B* from B,C,A
where
C.E1=B.E1 and C.E2=B.E2 and C.O1=A.O1
and A.O2 in (...)
and B.E3 in (...)
Would my joins be any significantly faster if I eliminated C and added O1 to table B instead?
B:E1,E2,O1....
First, you should use proper join syntax:
select A.*, B.*
from B join C
on C.E1=B.E1 and C.E2=B.E2 join
A
on C.O1=A.O1
where B.E3 in (...)
The "," means cross join and it is prone to problems since it is easily missed. Your query becomes much different if you say "from B C, A". Also, it gives you access to outer joins, and makes the statement much more readable.
The answer to your question is "yes". Your query will run faster if there are fewer tables being joined. If you are doing a join on a primary key, and the tables fit into memory, then the join is not expensive. However, it is still faster to just find the data in the record in B.
As I say this, there could be some boundary cases in some databases where this is not necessarily true. For instance, if there is only one value in the column, and the column is a long string, then adding the column onto pages could increase the number of pages needed for B. Such extreme cases are unlikely, and you should see performance improvements.
Speed will generally depends on number or rows the SQL Server has to read.
I don't think it makes any difference using a where clause or a join
Depends .. on how many rows are in the eliminated table
It can depend on the order you add the joins or where clauses in, e.g. if there are only a few rows in C, and you add that first as a table or where, it immediately cuts down on the number of matches that are possible in. If, however there are millions of rows in C, then you have to read millions to find the matches.
Modern optimizers can rearrange your query to be more efficient but dont rely on it.
What can really cut down the number of rows read is adding indexes to the join columns - if you have an index on A.O1 AND on C.O1 then it can cut down massivley on the number of reads.