What's this symbol (<>) doing to this select statement? - sql

The code below retrieved 74,700 rows from the database.
select * from Orders O
inner join customers C
on O.CustomerID <> c.CustomerID
The same code with = retrieves 830 records.
select * from Orders O
inner join customers C
on O.CustomerID = c.CustomerID
What's this not equal to symbol doing to my search query? The same difference is there in outer join too.
Thank you.

<> is the "not-equals" operator in SQL.
The query is getting all pairs or orders and customers where the customerId columns are different. What you really want is probably orders that don't have a valid customer id:
select o.*
from orders o left join
customers c
on o.CustomerID = c.CustomerID
where c.CustomerId is null;
(Actually, this seems unlikely if you have a proper foreign key relationship set up.)
Or more likely customers that don't have an order:
select c.*
from customers c left join
orders o
on o.CustomerID = c.CustomerID
where o.CustomerId is null;

The ON operator
Logically, every SQL query is executed in the following order:
FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY
You can read about this further from the official documentation from MSDN. SELECT (Transact-SQL)
This means the on predicate relates to the cardinal matches between tables, while the WHERE clause filters the results.
Cardinal means 1:N, or the number of matches to a side. In your example, ON A.CUSTOMER_ID = B.CUSTOMER_ID will return a row for every matching set from the source table
LEFT and RIGHT refer to which side is the source table. By default, the
left is considered the source table.
So if table A has 3 rows where ID = 3, then even if table B has only one ID of 3, you will return 3 rows; each row in Table A is treated separately.
A good join only uses the number of columns required to return a unique join, so that unwanted repeating values are not returned. Even if you meant to use a CROSS JOIN, you still need to make sure to use unique matching sets for your purpose.
Relationally, what does the joins mean?
This is the real question: what do the tables represent and how do they answer a question?
Relational means value, information, a question or query answered.
When you know what the batch or proc does or what its purpose is for the script(s), identifying silly queries becomes easier.
CONCLUSION
ON ID = ID - selects matching rows.
ON ID <> ID - returns every nonmatching row for every row in the source table. Essentially a cross join minus the actual join rows.
Good practice is to use the ON to identify unique rows that match and the WHERE clause to filter this result on the side of the source table.

<> symbol means not equal to ie) O.CustomerID not equal to c.CustomerID or
you can use != which also means not equal to in sql

The not equal <> operator returns true when the values are NOT EQUAL
The code on O.CustomerID <> c.CustomerID seems to join every row of the orders table with every row of the customers table that is not equal to it. Here is an example in the SQL fiddle.
http://sqlfiddle.com/#!9/e05f92/2/0
As you can see, in the top select (one where an = sign is used), it only selects the rows where the order customerID is equal to the Customer customerID
In the bottom select (where the <> is used) it joins every customer row, with every possible order row which is not equal, which is why you get so many results for the <> query.

Related

Does Any keyword in sql gives distinct records when used in a subquery

I looked at a query of "ANY" from a tutorial site which was like:
SELECT ProductName
FROM Products
WHERE ProductID = ANY (SELECT ProductID
FROM OrderDetails
WHERE Quantity = 10);
This query is returning 31 rows and no duplicates.
After this I tried to apply same query using Joins but I was unable to get result coming from above query.
Join query I used:
SELECT Products.ProductName
FROM Products
LEFT JOIN OrderDetails ON Products.ProductID = OrderDetails.ProductID
WHERE OrderDetails.Quantity = 10
ORDER BY Products.ProductName;
This is returning 44 rows, and has duplicates included.
After I used DISTINCT in this join query with ProductName, I got the desired result.
Hence I want to know - does "ANY" clause produce distinct records?
PS: Same record came in both Join queries (with and without distinct) with Inner Join as Well.
A join is a completely different operation to that of any (or similar all).
any is a logical operator and in your example is used to determine whether each row in Products should be returned.
The most rows that could be returned is equal to the number of rows in Products if the boolean result of the any operator is true for each ProductId.
By joining the tables, the two inputs to the join operator are compared and matching rows are output, which means if a single productId is input from Products and the input from Orderdetails has two rows with the same ProductId values ie with Quantity=10 the result is 2 rows are output, 1 for each matching row.
Hence I want to know - does "ANY" clause produce distinct records?
No. It is actually the opposite. The records being chosen are those in the FROM clause. So, in the first query, there are no duplicates in Products. The WHERE clause is never going to generate duplicate records. That is not a property of ANY in particular; it is also true of IN and EXISTS and any other comparison operation.
What is opposite is that the JOIN does produce duplicate records. That is what you are seeing in the second query. The table OrderDetails has multiple rows for a given product.
Note that ANY (and IN) do actually implement a type of JOIN called a semi-join. So, there is a relationship between what these operators do and JOINs in the FROM clause. However, semi-joins and anti-joins are different from inner and outer joins that are defined in the FROM clause.

What is a correlated subquery ? Why is it different of a non-correlated subquery ? And why did we need Alias?

My Problem
I have a question about subquery in SQL.
I try, among other things, to check if I understood the principle of correlated subquery, but also to understand the interest of the aliases in this one.
To do this, I will use an example and try to explain how I understand the correlated subquery.
The Example
Consider the following Query as an example:
SELECT custid, companyname
FROM Sales.Customers AS C
WHERE EXISTS
(SELECT *
FROM Sales.Orders AS O
WHERE O.custid = C.custid
AND O.orderdate = '20070212');
From this Query, I separate the Outer and Inner Query :
Outer Query :
SELECT custid, companyname
FROM Sales.Customers AS C
WHERE EXISTS
Inner Query :
SELECT *
FROM Sales.Orders AS O
WHERE O.custid = C.custid
AND O.orderdate = '20070212'
My Understanding
From what I understood this is the reason we can talk about correlated Subquery in this case:
In SQL, a query is done line by line. He will select for line 1 then for line 2, then line 3, etc.
Since in my Inner Query, I use C.custid (ie a column whose value will be read line by line by My Outer Query) and that I compare it to O.custid (a whole column which will also have to be read line by line).
This query will need all the rows of my O array to be first examined before going to the next line in my C array. For this reason, this is a correlated SubQuery.
In other words, the Query will execute as follows:
The Outer Query found the value of "custid" and "companyname" for my FIRST ROW in my C TABLE
The Inner Query gonna look at the first Row of my O TABLE
The Inner Query gonna compare the value of C.custid found in the FIRST ROW of the C TABLE to the value of the O.custid found in the FIRST ROW of the O TABLE
The Inner Query gonna compare the value of O.orderdate found in the FIRST ROW of the O TABLE to '20070212'.
The Inner Query gonna go at the NEXT ROW of my O TABLE
The Inner Query gonna repeat the step 2 to 5 with the NEXT ROW of the O TABLE instead of the FIRST ROW of the O TABLE until he arrive at the end of the O TABLE
The Outer Query gonna look at the next ROW
The Step 2-7 will repeat but this time, they gonna compare O.custid to the value of C.custid in the NEXT ROW and this, until the end of the C TABLE
My Second Problem
Now, in case I correctly understood the principle of correlated subquery.
The question I ask myself is this:
Why should we use aliases?
In the example above, we could say that it is because we use two tables with each a column of the same name.
However, in a case where the two columns would not have been named both named "custid", what would have been the utility of the aliases ?
Is it because the "SELECT" command modifies the table a certain way ?
Because if the table is not modified, I have trouble understanding the necessity of aliases in correlated subquery.
Note : I know that Correlated Subquery can also be optimizated by using Join, but I really want to focus of the base of the correlated SubQuery.
So, apparently, my understanding of the correlated subquery is good.
In a non-correlated subquery :
The Inner Query will be execute once (in Total)
But in an Correlated Subquery, it will be execute once for each row of my outer query.
However, I still don't know why an alias is needed.

Using an sql join's "ON" statement with an "AND" instead of a "WHERE"

So I'm wondering which one is best performance wise and also if using an AND is simply bad practice here.
Compare the two following queries ends :
Using a "WHERE" at the end :
select c.cust_last_name,
o.order_total,
oi.quantity
from customers c
join orders o on (c.customer_id = o.customer_id)
join order_items oi on (o.order_id = oi.order_id)
where c.GENDER='M';
Using an "AND" at the end :
select c.cust_last_name,
o.order_total,
oi.quantity
from customers c
join orders o on (c.customer_id = o.customer_id)
join order_items oi on (o.order_id = oi.order_id and c.GENDER='M');
The and is riding on the last ON's conditions to retrieve the exact same dataset as the first query. Is this OK?
In this instance, I doubt it would make much difference to Oracle which version of the query you used. You could check by looking at the explain plan for each query.
However, it is only "safe" to move the c.gender = 'M' predicate into the join condition here because you're doing an inner join. If you were doing an outer join, you'd see different results depending on whether that predicate was in the where or join clause.
It makes no difference whether to put the condition in the WHERE clause or an ON clause.
But yes, what you show is bad practise, because and c.GENDER='M' has nothing to do with which records to join from table order_items. The criteria in an ON clause should always belong with its table.
An example with additional criteria on the order items table would be
join order_items oi on (o.order_id = oi.order_id and and oi.price > 50)
Here it is more or less a matter of personal preference if you want to see this in the ON clause or WHERE clause. You could argue that you join the tables on their order IDs and then only keep results with a price higher then 50, so the join is on the IDs only. Or you could argue that you join order items with a price > 50. Both statements are semantically correct.
However it is a good habit to always have all criteria on a table in its ON clause. When you change
inner join order_items oi on (o.order_id = oi.order_id)
where oi.price > 50
to
left join order_items oi on (o.order_id = oi.order_id)
where oi.price > 50
this is effectively an inner join still, because the outer-joined records will have a price of NULL which doesn't meet your WHERE clause criteria so you'd remove the records right after creating them :-) So you would have to move the criteria to the ON clause because of the other join type. Wouldn't it be better to have it there already?
I think there is a difference in performance which is not much significant unless you have a big amount of records in your tables.
The first case is first getting all the relevant records according to the inner join and then filtering according to male gender. You are loading all the records which part of them are not relevant (female) and then filtering.
In the second case, the non-relevant records will not be gathered at all, the filtering is done as part of the join operation.
And I agree with #Radu Gheorghiu, you may want to move c.GENDER = 'M' condition one level higher.

Is it better to use where instead of adding conditions into join clause in SQL queries? [duplicate]

This question already has answers here:
INNER JOIN ON vs WHERE clause
(12 answers)
Closed 8 years ago.
Hello :) I've got a question on MySQL queries.
Which one's faster and why?
Is there any difference at all?
select tab1.something, tab2.smthelse
from tab1 inner join tab2 on tab1.pk=tab2.fk
WHERE tab2.somevalue = 'value'
Or this one:
select tab1.something, tab2.smthelse
from tab1 inner join tab2 on tab1.pk=tab2.fk
AND tab2.somevalue = 'value'
As Simon noted, the difference in performance should be negligible. The main concern would be ensuring your query correctly expresses your intent, and (especially) you get the expected results.
Generally, you want to add filters to the JOIN clause only if the filter is a condition of the join. In most (not all) cases, a filter should be applied to the WHERE clause, as it is a filter of the overall query, not of the join itself.
AFAIK, the only instance where this really affects the outcome of the query is when using an OUTER JOIN.
Consider the following queries:
SELECT *
FROM Customer c
LEFT JOIN Orders o ON c.CustomerId = o.CustomerId
WHERE o.OrderType = "InternetOrder"
vs.
SELECT *
FROM Customer c
LEFT JOIN Orders o ON c.CustomerId = o.CustomerId AND o.OrderType = "InternetOrder"
The first will return one row for each customer order that has an order type of "Internet Order". In effect, your left join has become an inner join because of the filter that was applied to the whole query (i.e. customers who do not have an "InternetOrder" will not be returned at all).
The second will return at least one row for each customer. If the customer has no orders of order type "Internet Order", it will return null values for all order table fields. Otherwise it will return one row for each customer order of type "Internet Order".
If the constraint is based off the joined table (as yours is) then it makes sense to specify the constraint when you join.
This way MySQL is able to reduce the rows from the joined table at the time it joins, as otherwise it will need to be able to select all data that fulfills the basic JOIN criteria prior to applying the WHERE logic.
In reality you'll see little difference in performance until you get to more complex queries or larger datasets, however limiting the data at each JOIN will be more efficient overall if done well especially if there are good indexes on the joined table.

Use of IN and EXISTS in SQL

Assuming that one has three Tables in a Relational Database as :
Customer(Id, Name, City),
Product(Id, Name, Price),
Orders(Cust_Id, Prod_Id, Date)
My first question is what is the best way to excecute the query: "Get all the Customers who ordered a Product".
Some people propose the query with EXISTS as:
Select *
From Customer c
Where Exists (Select Cust_Id from Orders o where c.Id=o.cust_Id)
Is the above query equivalent (can it be written?) as:
Select *
From Customer
Where Exists (select Cust_id from Orders o Join Customer c on c.Id=o.cust_Id)
What is the problem when we use IN instead of EXISTS apart from the performance as:
Select *
From Customer
Where Customer.Id IN (Select o.cust_Id from Order o )
Do the three above queries return exactly the same records?
Update: How does really the EXISTS evaluation works in the second query (or the first), considering that it checks only if the Subquery returns true or false? What is the "interpretation" of the query i.e.?
Select *
From Customer c
Where Exists (True)
The first two queries are different.
The first has a correlated subquery and will return what you want -- information about customers who have an order.
The second has an uncorrelated subquery. It will return either all customers or no customers, depending on whether or not any customers have placed an order.
The third query is an alternative way of expressing what you want.
The only possible issue that I can think of would arise when cust_id might have NULL values. In such a case, the first and third queries may not return the same results.
Yes, each of those three should return identical result sets.
Your second query is incorrect, as #ypercube points out in the commends. You're checking whether an uncorrellated subquery EXISTS
Of the two that work (1, 3), I'd expect #3 to be the fastest depending on your tables because it only executes the subquery one time.
However your most effective result is probably none of them but this:
SELECT DISTINCT
c.*
FROM
Customer c
JOIN
Orders o
ON o.[cust_id] = c.[Id]
because it should just be an index scan and a hash.
You should check the query plans and/or benchmark each one.
The best way to execute that query is to add orders to the from clause and join to it.
select distinct c.*
from customers c,
orders o
where c.id = o.cust_id
Your other queries may be more inefficient (depending on the shape of the data) but they should all return the same result set.