Use of IN and EXISTS in SQL - sql

Assuming that one has three Tables in a Relational Database as :
Customer(Id, Name, City),
Product(Id, Name, Price),
Orders(Cust_Id, Prod_Id, Date)
My first question is what is the best way to excecute the query: "Get all the Customers who ordered a Product".
Some people propose the query with EXISTS as:
Select *
From Customer c
Where Exists (Select Cust_Id from Orders o where c.Id=o.cust_Id)
Is the above query equivalent (can it be written?) as:
Select *
From Customer
Where Exists (select Cust_id from Orders o Join Customer c on c.Id=o.cust_Id)
What is the problem when we use IN instead of EXISTS apart from the performance as:
Select *
From Customer
Where Customer.Id IN (Select o.cust_Id from Order o )
Do the three above queries return exactly the same records?
Update: How does really the EXISTS evaluation works in the second query (or the first), considering that it checks only if the Subquery returns true or false? What is the "interpretation" of the query i.e.?
Select *
From Customer c
Where Exists (True)

The first two queries are different.
The first has a correlated subquery and will return what you want -- information about customers who have an order.
The second has an uncorrelated subquery. It will return either all customers or no customers, depending on whether or not any customers have placed an order.
The third query is an alternative way of expressing what you want.
The only possible issue that I can think of would arise when cust_id might have NULL values. In such a case, the first and third queries may not return the same results.

Yes, each of those three should return identical result sets.
Your second query is incorrect, as #ypercube points out in the commends. You're checking whether an uncorrellated subquery EXISTS
Of the two that work (1, 3), I'd expect #3 to be the fastest depending on your tables because it only executes the subquery one time.
However your most effective result is probably none of them but this:
SELECT DISTINCT
c.*
FROM
Customer c
JOIN
Orders o
ON o.[cust_id] = c.[Id]
because it should just be an index scan and a hash.
You should check the query plans and/or benchmark each one.

The best way to execute that query is to add orders to the from clause and join to it.
select distinct c.*
from customers c,
orders o
where c.id = o.cust_id
Your other queries may be more inefficient (depending on the shape of the data) but they should all return the same result set.

Related

What's this symbol (<>) doing to this select statement?

The code below retrieved 74,700 rows from the database.
select * from Orders O
inner join customers C
on O.CustomerID <> c.CustomerID
The same code with = retrieves 830 records.
select * from Orders O
inner join customers C
on O.CustomerID = c.CustomerID
What's this not equal to symbol doing to my search query? The same difference is there in outer join too.
Thank you.
<> is the "not-equals" operator in SQL.
The query is getting all pairs or orders and customers where the customerId columns are different. What you really want is probably orders that don't have a valid customer id:
select o.*
from orders o left join
customers c
on o.CustomerID = c.CustomerID
where c.CustomerId is null;
(Actually, this seems unlikely if you have a proper foreign key relationship set up.)
Or more likely customers that don't have an order:
select c.*
from customers c left join
orders o
on o.CustomerID = c.CustomerID
where o.CustomerId is null;
The ON operator
Logically, every SQL query is executed in the following order:
FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY
You can read about this further from the official documentation from MSDN. SELECT (Transact-SQL)
This means the on predicate relates to the cardinal matches between tables, while the WHERE clause filters the results.
Cardinal means 1:N, or the number of matches to a side. In your example, ON A.CUSTOMER_ID = B.CUSTOMER_ID will return a row for every matching set from the source table
LEFT and RIGHT refer to which side is the source table. By default, the
left is considered the source table.
So if table A has 3 rows where ID = 3, then even if table B has only one ID of 3, you will return 3 rows; each row in Table A is treated separately.
A good join only uses the number of columns required to return a unique join, so that unwanted repeating values are not returned. Even if you meant to use a CROSS JOIN, you still need to make sure to use unique matching sets for your purpose.
Relationally, what does the joins mean?
This is the real question: what do the tables represent and how do they answer a question?
Relational means value, information, a question or query answered.
When you know what the batch or proc does or what its purpose is for the script(s), identifying silly queries becomes easier.
CONCLUSION
ON ID = ID - selects matching rows.
ON ID <> ID - returns every nonmatching row for every row in the source table. Essentially a cross join minus the actual join rows.
Good practice is to use the ON to identify unique rows that match and the WHERE clause to filter this result on the side of the source table.
<> symbol means not equal to ie) O.CustomerID not equal to c.CustomerID or
you can use != which also means not equal to in sql
The not equal <> operator returns true when the values are NOT EQUAL
The code on O.CustomerID <> c.CustomerID seems to join every row of the orders table with every row of the customers table that is not equal to it. Here is an example in the SQL fiddle.
http://sqlfiddle.com/#!9/e05f92/2/0
As you can see, in the top select (one where an = sign is used), it only selects the rows where the order customerID is equal to the Customer customerID
In the bottom select (where the <> is used) it joins every customer row, with every possible order row which is not equal, which is why you get so many results for the <> query.

Specifying SELECT, then joining with another table

I just hit a wall with my SQL query fetching data from my MS SQL Server.
To simplify, say i have one table for sales, and one table for customers. They each have a corresponding userId which i can use to join the tables.
I wish to first SELECT from the sales table where say price is equal to 10, and then join it on the userId, in order to get access to the name and address etc. from the customer table.
In which order should i structure the query? Do i need some sort of subquery or what do i do?
I have tried something like this
SELECT *
FROM Sales
WHERE price = 10
INNER JOIN Customers
ON Sales.userId = Customers.userId;
Needless to say this is very simplified and not my database schema, yet it explains my problem simply.
Any suggestions ? I am at a loss here.
A SELECT has a certain order of its components
In the simple form this is:
What do I select: column list
From where: table name and joined tables
Are there filters: WHERE
How to sort: ORDER BY
So: most likely it was enough to change your statement to
SELECT *
FROM Sales
INNER JOIN Customers ON Sales.userId = Customers.userId
WHERE price = 10;
The WHERE clause must follow the joins:
SELECT * FROM Sales
INNER JOIN Customers
ON Sales.userId = Customers.userId
WHERE price = 10
This is simply the way SQL syntax works. You seem to be trying to put the clauses in the order that you think they should be applied, but SQL is a declarative languages, not a procedural one - you are defining what you want to occur, not how it will be done.
You could also write the same thing like this:
SELECT * FROM (
SELECT * FROM Sales WHERE price = 10
) AS filteredSales
INNER JOIN Customers
ON filteredSales.userId = Customers.userId
This may seem like it indicates a different order for the operations to occur, but it is logically identical to the first query, and in either case, the database engine may determine to do the join and filtering operations in either order, as long as the result is identical.
Sounds fine to me, did you run the query and check?
SELECT s.*, c.*
FROM Sales s
INNER JOIN Customers c
ON s.userId = c.userId;
WHERE s.price = 10

SQL Cross Join better in performance than normal join?

I'm currently working with SQL and wondered about cross join.
Assuming I have the following relations:
customer(customerid, firstname, lastname)
transact(customerid, productid, date, quantity)
product(productid, description)
This query is written in Oracle SQL. It should select the last name of all customers which bought more than 1000 quantities of a product (rather senseless but no matter):
SELECT c.lastname, t.date
FROM customer c, transact t
WHERE t.quantity > 1000
AND t.customerid = c.customerid
Isn't this doing a cross join?! Isn't this extremely slow when the tables consist of a huge amount of data?
Isn't it better to do something like this:
SELECT c.lastname, t.date
FROM customer c
JOIN transact t ON(c.customerid = t.customerid)
WHERE t.quantity > 1000
Which is better in performance? And how are these queries handled internally?
Thanks for your help,
Barbara
The two queries aren't equivalent, because:
SELECT lastname, date
FROM customer, transact
WHERE quantity > 1000
Doesn't actually limit to customers that bought > 1000, it's simply taking every combination of rows from those two tables, and excluding any with quantity less than or equal to 1000 (all customers will be returned).
This query is equivalent to your JOIN version:
SELECT lastname, date
FROM customer c, transact t
WHERE quantity > 1000
AND c.customerid = t.customerid
The explicit JOIN version is preferred as it's not deprecated syntax, but both should have the same execution plan and identical performance. The explicit JOIN version is easier to read in my opinion, but the fact that the comma listed/implicit method has been outdated for over a decade (two?) should be enough reason to avoid it.
This is too long for a comment.
If you want to know how they are handled then look at the query plan.
In your case, the queries are not the same. The first does a cross join with conditions on only one table. The second does a legitimate join. The second is the right way to write the query.
However, even if you included the correct where clause in the first query, then the performance should be the same. Oracle is smart enough to recognize that the two queries do the same thing (if written correctly).
Simple rule: never use commas in the from clause. Always use explicit join syntax.

How to reduce scope of subquery?

I've got SQL running on MS SQL Server similar to the following:
SELECT
CustNum,
Name,
FROM
Cust
LEFT JOIN (
SELECT
CustNum, MAX(OrderDate) as LastOrderDate
FROM
Orders
GROUP BY
CustNum) as Orders
ON Orders.CustNum = Cust.CustNum
WHERE
Region = 1
It contains a subquery to find the MAX record from a child table. The concern is that these tables have a very large number of rows. It seems like the subquery would operate on all the rows of the child table, even though only a very few of them are actually needed because of the WHERE clause on the outer query
Is there a way to reduce the scope of the inner query? Something like adding a WHERE clause to only include the records that are included in the outer query? Something like
WHERE CustomerOrders.CustomerNumber = Customers.CustomerNumber -- Customers from the outer query.
I suspect that this is not necessary, but I am getting some push back from another developer and I wanted to be sure (my SQL is a little rusty).
You are correct about the subquery. It will have to summarize all the data. You could re-write the query like this:
SELECT CustNum, Name, max(OrderDate) as LastOrderDate
FROM Cust LEFT JOIN
Orders
ON Orders.CustNum = Cust.CustNum
WHERE Region = 1
group by CustNum, Name
This would let the SQL optimizer choose the optimal path.
If you know that there are very, very few customers matching Region = 1 and you have an index on CustNum, OrderDate in Orders, you could write the query like this:
select CustNum, Name,
(select top 1 OrderDate
from Orders o
where Cust.CustNum = o.CustNum
order by OrderDate desc
) as LastOrderDate
from Cust
Where Region = 1
I think you would get a very similar effect by using cross apply.
By the way, I'm not a fan of re-writing queries for such purposes. But, I haven't found a SQL optimizer that would do anything other than summarize all the orders rows in this case.
No it's generally not necessary if your statistics etc are up to date. That's the job of the optimiser. You can try the CROSS APPLY operator if you think you're missing out on some shortcuts but generally if you have all constraints and stats it will be fine.
Your proposed additional WHERE might make sense to you, but as it doesn't correlate to anything in the actual query you posted it will change the results (if it works at all). If you want comments on that you need to post tables & relations etc.
Best way is to check the execution plan and see if it's doing anything dumb.

How to select all attributes in sql Join query

The following sql query below produces the specified result.
select product.product_no,product_type,salesteam.rep_name,salesteam.SUPERVISOR_NAME
from product
inner join salesteam
on product.product_rep=salesteam.rep_id
ORDER BY product.Product_No;
However my intensions are to further produce a more detailed result which will include all the attributes in the PRODUCT table. my approach is to list all the attributes in the first line of the query.
select product.product_no,product.product_date,product.product_colour,product.product_style,
product.product_age product_type,salesteam.rep_name,salesteam.SUPERVISOR_NAME
from product
inner join salesteam
on product.product_rep=salesteam.rep_id
ORDER BY product.Product_No;
Is there another way it can be done instead of listing all the attributes of PRoduct table one by one?
You can use * to select all columns from all tables, or you can use [table/alias].* to select all columns from the specified table. In your case, you can use product.*:
select product.*,salesteam.rep_name,salesteam.SUPERVISOR_NAME
from product
inner join salesteam
on product.product_rep=salesteam.rep_id
ORDER BY product.Product_No;
It is important to note that you should only do this if you are 100% sure you need every single column, and always will. There are performance implications associated with this; if you're selecting 100 columns from a table when you really only need 4 or 5 of them, you're adding a lot of overhead to the query. The DBMS has to work harder, and you're also sending more data across the wire (if your database is not on the same machine as your executing code).
If any columns are later added to the product table, those columns will also be returned by this query in the future.
select
product.*,
salesteam.rep_name,
salesteam.SUPERVISOR_NAME
from product inner join salesteam on
product.product_rep=salesteam.rep_id
ORDER BY
product.Product_No;
This should do.
You can write like this
select P.* --- all Product columns
,S.* --- all salesteam columns
from product P
inner join salesteam S
on P.product_rep=S.rep_id
ORDER BY P.Product_No;