Inner join vs separate statements - sql

I am trying to get some more context on this out of curiosity. So far when I run 2 separate sql statements I find in SQL Profiler that I have no CPU cycles, less reads and less duration than taking the script and using Inner join. Is this a typical case, I am looking for help to understand this better.
Simple example:
SELECT * FROM dbo.ChargeCode
SELECT * FROM dbo.ChargeCodeGroup
vs
SELECT *
FROM dbo.ChargeCode c
INNER JOIN dbo.ChargeCodeGroup cc ON c.ChargeCodeGroupID = cc.ChargeCodeGroupID
From what I guess is that inner join cost extra CPU cycles because its doing a nested loop. Am I on the right track with this?

The simple answer is that you're doing two different things here. In your 1st example you're retrieving 2 separate entities. In your second example, your asking the RDBMS to combine (join) 2 entities into a single result set.
A join is one of the most powerful capabilities of an RDBMS - and it will (usually) do it as efficiently as it possibly can - but that's not to say it's free or cheap.

SELECT * FROM sometable
must scan whole table.
If there are indexes on ChargeCodeGroupID column on either table, it will be much faster for INNER JOIN to only scan index. (By their name, I guess there are). Of course, if there is no index on either ChargeCodeGroupID column, second query is slower than the first one.

Related

is it true if we switch the position of table in join query will increase load data speed?

For an example:
In table a we have 1000000 rows
In table b we have 5 rows
It's more faster if we use
select * from b inner join a on b.id = a.id
than
select * from a inner join b on a.id = b.id
No, JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff. JOIN by order is changed during optimization.
You might test it all by yourself, download some test databases like AdventureWorks or Northwind or try it on your database, you might do this:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.
It is unlikely. There are lots of factors on the speed of joining two tables. That is why database engines have an optimization phase, where they consider different ways of implementing the query.
There are many different options:
Nested loops, scanning b first and then a.
Nested loops, scanning a first and then b.
Sorting both tables and using a merge join.
Hashing both tables and using a hash join.
Using an index on b.id.
Using an index on a.id.
And these are just high level descriptions -- there are multiple ways to implement some of these methods. Tables can also be partitioned adding further complexity.
Join order is just one consideration.
In this case, the result of the query is likely to depend on the size of the data being returned, rather than the actual algorithm used for fetching the data.

How to improve the performance of multiple joins

I have a query with multiple joins in it. When I execute the query it takes too long. Can you please suggest me how to improve this query?
ALTER View [dbo].[customReport]
As
SELECT DISTINCT ViewUserInvoicerReport.Owner,
ViewUserAll.ParentID As Account , ViewContact.Company,
Payment.PostingDate, ViewInvoice.Charge, ViewInvoice.Tax,
PaymentProcessLog.InvoiceNumber
FROM
ViewContact
Inner Join ViewUserInvoicerReport on ViewContact.UserID = ViewUserInvoicerReport.UserID
Inner Join ViewUserAll on ViewUserInvoicerReport.UserID = ViewUserAll.UserID
Inner Join Payment on Payment.UserID = ViewUserAll.UserID
Inner Join ViewInvoice on Payment.UserID = ViewInvoice.UserID
Inner Join PaymentProcessLog on ViewInvoice.UserID = PaymentProcessLog.UserID
GO
Work on removing the distinct.
THat is not a join issue. The problem is that ALL rows have to go into a temp table to find out which are double - if you analyze the query plan (programmers 101 - learn to use that fast) you will see that the join likely is not the big problem but the distinct is.
And IIRC that distinct is USELESS because all rows are unique anyway... not 100% sure, but the field list seems to indicate.
Use distincts VERY rarely please ;)
You should see the Query Execution Plan and optimize the query section by section.
The overall optimization process consists of two main steps:
Isolate long-running queries.
Identify the cause of long-running queries.
See - How To: Optimize SQL Queries for step by step instructions.
and
It's difficult to say how to improve the performance of a query without knowing things like how many rows of data are in each table, which columns are indexed, what performance you're looking for and which database you're using.
Most important:
1. Make sure that all columns used in joins are indexed
2. Make sure that the query execution plan indicates that you are using the indexes you expect

order of tables in FROM clause

For an sql query like this.
Select * from TABLE_A a
JOIN TABLE_B b
ON a.propertyA = b.propertyA
JOIN TABLE_C
ON b.propertyB = c.propertyB
Does the sequence of the tables matter. It wont matter in results, but do they affect the performance?
One can assume that the data in table C is much larger that a or b.
For each sql statement, the engine will create a query plan. So no matter how you put them, the engine will chose a correct path to build the query.
More on plans you have http://en.wikipedia.org/wiki/Query_plan
There are ways, considering what RDBMS you are using to enforce the query order and plan, using hints, however, if you feel that the engine does no chose the correct path.
Sometimes Order of table creates a difference here,(when you are using different joins)
Actually our Joins working on Cross Product Concept
If you are using query like this A join B join C
It will be treated like this (A*B)*C)
Means first result comes after joining A and B table then it will make join with C table
So if after inner joining A (100 record) and B (200 record) if it will give (100 record)
And then these ( 100 record ) will compare with (1000 record of C)
No.
Well, there is a very, very tiny chance of this happening, see this article by Jonathan Lewis. Basically, the number of possible join orders grows very quickly, and there's not enough time for the Optimizer to check them all. The sequence of the tables may be used as a tie-breaker in some very rare cases. But I've never seen this happen, or even heard about it happening, to anybody in real life. You don't need to worry about it.

SQL Server Logical Query Processing - how does it manage the huge datasets?

I'm doing some reading on SQL Server performance:
http://www.amazon.com/Inside-Microsoft-SQL-Server-2005/dp/0735623139/ref=sr_1_6?ie=UTF8&s=books&qid=1267032068&sr=8-6
One of the surprising things I came across was how it processes the "FROM" phase in its Logical Processing. From what I understand, SQL Server will do the following:
1) For the first two tables, it will create a virtual table (VT1) consisting of a Cartesian join of the two tables
2) For every additional table, it will create a Cartesian join of VT1 and the additional table, with the result becoming VT1
I'm sure there is alot more to it under the covers, but at face value, this seems like it would involve a huge amount of processing/memory if you're dealing with big tables (and big queries).
I was just wondering whether anyone had a quick explanation of how SQL Server is able to do this in any sort of realistic time/space frame?
The carthesian join is just a description of the result, not an actual result. After the full carthesian join of tables A, B, C...X, the filter operators are applied (still as a definition), things like ON clauses of the join and WHERE clauses of the query. In the end this definition is in turn transformed into an execution plan, which will contain physicall operators like Nested Loops or Hash Join or Merge Join, and this operators, when iterated, will produce the results as requested in the query definition.
So the big 100x100x100x100... carthesian cube is never materialized, is just a definition.
If you are really interested in how SQL Server does what it does, please read this book:
http://www.amazon.com/Microsoft-SQL-Server-2008-Internals/dp/0735626243/ref=sr_1_1?ie=UTF8&s=books&qid=1267033666&sr=8-1
In reality the optimiser will look at the whole query, estimated rows, statistics, constraints etc
Logically, it is in the order mentioned though
Contrived example:
SELECT
BT.col1, LT.col2
FROm
BigTable BT
JOIN
LT.Table LT ON BT.FKCol = LT.PKCol
WHERE
LT.PKCol = 2
ORDER BY
BT.col1
The cartesian of BT and LT could be 100s of millions.
But the optimiser:
knows PKCol is unique so it expects only one row
can use statistics to estimate the number of rows from BT
looks for indexes (eg covering index on BT for FKCol INLCUDE col1)
will probably apply the WHERE first
will look ahead for an ORDER BY or GROUP BY for example to see if it can save some spooling (resorting)
I don't know the resource you are reading, but what you describe is the behavior of:
SELECT ... FROM tableA, tableB, tableC, ....
This uses a cartesian join (also called a cross join) and is very expensive. With large enough datasets SQL Server (or any RDBMS) can't do this in any sort of realistic time/space frame.
Using an ON clause and specifying the JOIN type performs vastly better:
SELECT ... FROM tableA JOIN tableB on tableB.a_id = tableA.a_id
In real applications cross joins should be rare or at least limited to very small datasets. For many applications it's not uncommon to never have a cross join.

simple sql query

which one is faster
select * from parents p
inner join children c on p.id = c.pid
where p.x = 2
OR
select * from
(select * from parents where p.x = 2)
p
inner join children c on p.id = c.pid
where p.x = 2
In MySQL, the first one is faster:
SELECT *
FROM parents p
INNER JOIN
children c
ON c.pid = p.id
WHERE p.x = 2
, since using an inline view implies generating and passing the records twice.
In other engines, they are usually optimized to use one execution plan.
MySQL is not very good in parallelizing and pipelining the result streams.
Like this query:
SELECT *
FROM mytable
LIMIT 1
is instant, while this one (which is semantically identical):
SELECT *
FROM (
SELECT *
FROM mytable
)
LIMIT 1
will first select all values from mytable, buffer them somewhere and then fetch the first record.
For Oracle, SQL Server and PostgreSQL, the queries above (and both of your queries) will most probably yield the same execution plans.
I know this is a simple case, but your first option is much more readable than the second one. As long as the two query plans are comparable I'd always opt for the more maintainable SQL code which your first example is for me.
It depends on how good the database is at optimising the query.
If the database manages to optimise the second one into the first one, they are equally fast, otherwise the first one is faster.
The first one gives more freedom for the database to optimise the query. The second one suggests a specific order of doing things. Either the database is able to see past this and optimise it into a single query, or it will run the query as two separate queries with the subquery as an intermediate result.
A database like SQL Server keeps statistics on what the database tables contain, which it uses to determine how to execute the query in the most efficient way. For example, depending on what will elliminate most records it can either start with joining the tables or filtering the parents table on the condition. If you write a query that forces a specific order, that might not be the most efficient order.
I'd think the first. I'm not sure if the optimizer would use any indexes on the the derived table in the second query, or if it would copy out all the rows that match into memory before joining back to the children.
This is why you have DBAs. It depends entirely on the DBMS, and how your tables and indexes are configured, as to which one runs the fastest.
Database tuning is not a set-and-forget operation, it should be done regularly, as the data changes, to ensure your database runs at peak performance. The question is not really meaningful without specifying:
which DBMS you are asking about.
what indexes you have on the tables.
a host of other possible configuration items (which may also depend on the DBMS, such as clustering).
You should run both those queries through the query optimizer to see which one is fastest, then start using that one. That's assuming the difference in noticeable in the first place. If the difference is minimal, go for the easiest to read/maintain.
For me, in the second query you are saying, I don't trust the optimizer to optimize this query so I'll provide some 'hints'.
I'd say, trust the optimizer until it let's you down and only then consider trying to do the optimizer's job for it.