SQL Cross Join better in performance than normal join? - sql

I'm currently working with SQL and wondered about cross join.
Assuming I have the following relations:
customer(customerid, firstname, lastname)
transact(customerid, productid, date, quantity)
product(productid, description)
This query is written in Oracle SQL. It should select the last name of all customers which bought more than 1000 quantities of a product (rather senseless but no matter):
SELECT c.lastname, t.date
FROM customer c, transact t
WHERE t.quantity > 1000
AND t.customerid = c.customerid
Isn't this doing a cross join?! Isn't this extremely slow when the tables consist of a huge amount of data?
Isn't it better to do something like this:
SELECT c.lastname, t.date
FROM customer c
JOIN transact t ON(c.customerid = t.customerid)
WHERE t.quantity > 1000
Which is better in performance? And how are these queries handled internally?
Thanks for your help,
Barbara

The two queries aren't equivalent, because:
SELECT lastname, date
FROM customer, transact
WHERE quantity > 1000
Doesn't actually limit to customers that bought > 1000, it's simply taking every combination of rows from those two tables, and excluding any with quantity less than or equal to 1000 (all customers will be returned).
This query is equivalent to your JOIN version:
SELECT lastname, date
FROM customer c, transact t
WHERE quantity > 1000
AND c.customerid = t.customerid
The explicit JOIN version is preferred as it's not deprecated syntax, but both should have the same execution plan and identical performance. The explicit JOIN version is easier to read in my opinion, but the fact that the comma listed/implicit method has been outdated for over a decade (two?) should be enough reason to avoid it.

This is too long for a comment.
If you want to know how they are handled then look at the query plan.
In your case, the queries are not the same. The first does a cross join with conditions on only one table. The second does a legitimate join. The second is the right way to write the query.
However, even if you included the correct where clause in the first query, then the performance should be the same. Oracle is smart enough to recognize that the two queries do the same thing (if written correctly).
Simple rule: never use commas in the from clause. Always use explicit join syntax.

Related

Joins and Subqueries

I am aware that correlated subqueries use "where" clause and not joins.
But I wonder if "where" clause and inner join can have the same outcome then why can't we use these queries with joins?
For example,
SELECT FirstName, LastName, (SELECT COUNT(O.Id) FROM [Order] O WHERE O.CustomerId = C.Id) As OrderCount
FROM Customer C
Now, why can't we write down this like below?
SELECT FirstName, LastName, (SELECT COUNT(O.Id) FROM [Order] O Inner Join
C On O.CustomerId = C.Id) As OrderCount
FROM Customer C
I know very well about SQL and worked quiet on that but I am just looking for a clear technical explanation.
Thanks.
This is your query:
SELECT
FirstName,
LastName,
(
SELECT COUNT(O.Id)
FROM [Order] O
INNER JOIN C On O.CustomerId = C.Id
) AS OrderCount
FROM Customer C;
It is invalid, because in the sub query you are selecting from C.
This is a bit complicated to explain. In a query, we deal with tables and table rows. E.g.:
select person.name from person;
FROM person means "from the table person". person.name means "a person's name", so it is referring to a row. It would be great if we could write:
select person.name from persons;
but SQL doesn't know about singular and plural in your language, so this is not possible.
In your query FROM Customer C means "from the customer table, which I'm going to call C for short". But in the rest of the query including the sub query it is one customer row the C refers to. So you cannot say INNER JOIN C, because you can only join to a table, not a table row.
One might try to make this clear by using plural names for tables and singular names as table aliases. If you'd make it a habit, you'd have FROM Customers Customer in your main query and INNER JOIN Customer in your inner query, and you'd notice from your habits, that you cannot have a singular in the FROM clause. But well, one gets quickly accustomed to that double meaning (row and table) of a table name in a query, so this would just be kind of over-defensive, and we'll rather use alias names to get queries shorter and more readable, just as you are doing it with abbreviating customer to c.
But yes, you can use joins instead of sub queries in the SELECT clause. Either move the sub query to the FROM clause:
SELECT
c.firstname,
c.lastname,
COALESCE(o.ordercount, 0) AS ordercount
FROM customer c
LEFT JOIN
(
SELECT customerid, COUNT(*) AS ordercount
FROM [order]
GROUP BY customerid
) o ON o.customerid = c.id;
Or join without a sub query:
SELECT
c.firstname,
c.lastname,
COUNT(o.customerid) AS ordercount
FROM customer c
LEFT JOIN [order] o ON o.customerid = c.id
GROUP BY c.firstname, c.lastname;
The two queries are functionally equivalent. SQL (in the context of queries) is a declarative language, which means it works by DEFINING WHAT you want to achieve, not HOW to achieve it. So, at the abstract algebrical level, between the two queries there is absolutely no difference. (*)
However, because SQL does not work in the metaphysical realm of algebra but in the real world where the declarative language of SQL needs to be transposed in a procedural sequence of operations: it is much easier for me to decide the two queries are equivalent than for the RMDB of your choice. Computing the closure of the SQL declarative query can be incredibly computationally difficult. This is done by what is usually called the "query optimizer", which has not only the function of "understanding" the relational algebra but also of finding the probabilistically best way to implement it procedurally. Therefore, depending on the accuracy of the optimizer, the intricacy of your schema and query and the amount of computational resources the optimizer allocates on closing and optimizing the execution plan, the actual execution plans for the two otherwise equivalent queries can be different. You will still get the same results (as long as you stay in the declarative realm and don't use any NOW(), RAND() or other volatile state semantics), but one plan way may be faster, another may be slower. Also the order of results may be different, where ORDER BY is missing or equivocal.
Note: your join can be rewritten this way because it involves an aggregate on a side join. Not all joins can be transposed using subqueries, but there are plenty of situations of other queries that are equivalent although expressed differently. My answer is absolutely generic for any mathematically equivalent queries. See also explanation below.
(*) Queries equivalence also depends on schema. One usual enemy of common sense is NULL values: while a join will filter out null values if there is any condition on them, aggregates will behave in variuos other ways: SUM will be null, MAX/MIN will ignore nulls, COUNT will count anything, COUNT(DISTINCT) nobody knows what will do, etc.

SQL Sub Query within a join

Is there a difference between the results of the two sets of code below?
If there isn't, I don't understand why my teachers keep teaching sub queries. When would they be useful in basic SQL commands?
Select soh.Total, c.*
From SalesLT.Customer As c
Inner join (select oh.CustomerID Sum(oh.TotalDue) As Total
From SalesLT.SalesOrderHeader As oh Group by oh.CustomerID
Having Sum(oh.totaldue) > 90000) As soh on c.CustomerID = soh.CustomerID
VS
Select A.*, C.*
From Sales as A inner join Customer as C on A.customerID=C.customerID
Group by A.CustomerID
Having Sum(C.totaldue) > 90000
Is there a difference? Well, obviously. The two are constructed differently.
Do they produce the same result? Obviously not. In fact, the second one will produce an error in almost all databases, because the columns from A are not aggregated.
In addition, the number of columns is likely to differ between the two queries, unless Customer has exactly two columns.
I would suggest that you study SQL a bit harder. If your teachers are suggesting that you need to understand subqueries, then that is probably because they are an important part of the language.
Homework: Write a reasonable second query that doesn't use subqueries.
Sub queries always took more time in term of performance and return results.
Where as inner joins provide faster way to fetch results and process queries.
So this is always good to user inner joins and avoid sub queries as much as possible, it effect execution time. To test more, try to add Execution Plan before running query in query panel.
This will show you difference of results and time took to execute.

How to reduce scope of subquery?

I've got SQL running on MS SQL Server similar to the following:
SELECT
CustNum,
Name,
FROM
Cust
LEFT JOIN (
SELECT
CustNum, MAX(OrderDate) as LastOrderDate
FROM
Orders
GROUP BY
CustNum) as Orders
ON Orders.CustNum = Cust.CustNum
WHERE
Region = 1
It contains a subquery to find the MAX record from a child table. The concern is that these tables have a very large number of rows. It seems like the subquery would operate on all the rows of the child table, even though only a very few of them are actually needed because of the WHERE clause on the outer query
Is there a way to reduce the scope of the inner query? Something like adding a WHERE clause to only include the records that are included in the outer query? Something like
WHERE CustomerOrders.CustomerNumber = Customers.CustomerNumber -- Customers from the outer query.
I suspect that this is not necessary, but I am getting some push back from another developer and I wanted to be sure (my SQL is a little rusty).
You are correct about the subquery. It will have to summarize all the data. You could re-write the query like this:
SELECT CustNum, Name, max(OrderDate) as LastOrderDate
FROM Cust LEFT JOIN
Orders
ON Orders.CustNum = Cust.CustNum
WHERE Region = 1
group by CustNum, Name
This would let the SQL optimizer choose the optimal path.
If you know that there are very, very few customers matching Region = 1 and you have an index on CustNum, OrderDate in Orders, you could write the query like this:
select CustNum, Name,
(select top 1 OrderDate
from Orders o
where Cust.CustNum = o.CustNum
order by OrderDate desc
) as LastOrderDate
from Cust
Where Region = 1
I think you would get a very similar effect by using cross apply.
By the way, I'm not a fan of re-writing queries for such purposes. But, I haven't found a SQL optimizer that would do anything other than summarize all the orders rows in this case.
No it's generally not necessary if your statistics etc are up to date. That's the job of the optimiser. You can try the CROSS APPLY operator if you think you're missing out on some shortcuts but generally if you have all constraints and stats it will be fine.
Your proposed additional WHERE might make sense to you, but as it doesn't correlate to anything in the actual query you posted it will change the results (if it works at all). If you want comments on that you need to post tables & relations etc.
Best way is to check the execution plan and see if it's doing anything dumb.

Use of IN and EXISTS in SQL

Assuming that one has three Tables in a Relational Database as :
Customer(Id, Name, City),
Product(Id, Name, Price),
Orders(Cust_Id, Prod_Id, Date)
My first question is what is the best way to excecute the query: "Get all the Customers who ordered a Product".
Some people propose the query with EXISTS as:
Select *
From Customer c
Where Exists (Select Cust_Id from Orders o where c.Id=o.cust_Id)
Is the above query equivalent (can it be written?) as:
Select *
From Customer
Where Exists (select Cust_id from Orders o Join Customer c on c.Id=o.cust_Id)
What is the problem when we use IN instead of EXISTS apart from the performance as:
Select *
From Customer
Where Customer.Id IN (Select o.cust_Id from Order o )
Do the three above queries return exactly the same records?
Update: How does really the EXISTS evaluation works in the second query (or the first), considering that it checks only if the Subquery returns true or false? What is the "interpretation" of the query i.e.?
Select *
From Customer c
Where Exists (True)
The first two queries are different.
The first has a correlated subquery and will return what you want -- information about customers who have an order.
The second has an uncorrelated subquery. It will return either all customers or no customers, depending on whether or not any customers have placed an order.
The third query is an alternative way of expressing what you want.
The only possible issue that I can think of would arise when cust_id might have NULL values. In such a case, the first and third queries may not return the same results.
Yes, each of those three should return identical result sets.
Your second query is incorrect, as #ypercube points out in the commends. You're checking whether an uncorrellated subquery EXISTS
Of the two that work (1, 3), I'd expect #3 to be the fastest depending on your tables because it only executes the subquery one time.
However your most effective result is probably none of them but this:
SELECT DISTINCT
c.*
FROM
Customer c
JOIN
Orders o
ON o.[cust_id] = c.[Id]
because it should just be an index scan and a hash.
You should check the query plans and/or benchmark each one.
The best way to execute that query is to add orders to the from clause and join to it.
select distinct c.*
from customers c,
orders o
where c.id = o.cust_id
Your other queries may be more inefficient (depending on the shape of the data) but they should all return the same result set.

Which of these queries is preferable?

I've written the same query as a subquery and a self-join.
Is there any obvious argument for one over the other here?
SUBQUERY:
SELECT prod_id, prod_name
FROM products
WHERE vend_id = (SELECT vend_id
FROM products
WHERE prod_id = ‘DTNTR’);
SELF-JOIN:
SELECT p1.prod_id, p1.prod_name
FROM products p1, products p2
WHERE p1.vend_id = p2.vend_id
AND p2.prod_id = ‘DTNTR’;
First query may throw error if the subquery returns more than a value
Second query is not as per ANSI
So better use ANSI style join
SELECT p1.prod_id, p1.prod_name
FROM products as p1 inner join products as p2
on p1.vend_id = p2.vend_id
WHERE p2.prod_id = ‘DTNTR’;
This post has some figures on execution times. The poster states:
The first query shows 49.2% of the batch while the second shows 50.8%, leading
one to think that the subquery is marginally faster.
Now, I started up Profiler and ran both queries. The first query required
over 92,000 reads to execute, but the one with the join required only 2300,
leading me to believe that the inner join is significantly faster.
There are conflicting responses though:
My rule of thumb: only use JOIN's if you need to output a column from the
table you are join'ing to; otherwise, use sub-queries.
and this:
Joining should always be faster - theoretically and realistically. Subqueries
- particularly correlated - can be very difficult to optimise. If you think
about it you will see why - technically, the subquery could be executed once
for each row of the outer query - blech!
I also agree with Madhivanan, if the sub query returns anything but one value your main query will fail, so use IN instead.
If you don't need any of the columns from the JOINed table, then using a subselect is generally preferable, although this is dependent on RDBMs type. An IN clause should be used instead:
SELECT prod_id, prod_name
FROM products
WHERE vend_id IN (SELECT vend_id
FROM products
WHERE prod_id = ‘DTNTR’);