Joins and Subqueries - sql

I am aware that correlated subqueries use "where" clause and not joins.
But I wonder if "where" clause and inner join can have the same outcome then why can't we use these queries with joins?
For example,
SELECT FirstName, LastName, (SELECT COUNT(O.Id) FROM [Order] O WHERE O.CustomerId = C.Id) As OrderCount
FROM Customer C
Now, why can't we write down this like below?
SELECT FirstName, LastName, (SELECT COUNT(O.Id) FROM [Order] O Inner Join
C On O.CustomerId = C.Id) As OrderCount
FROM Customer C
I know very well about SQL and worked quiet on that but I am just looking for a clear technical explanation.
Thanks.

This is your query:
SELECT
FirstName,
LastName,
(
SELECT COUNT(O.Id)
FROM [Order] O
INNER JOIN C On O.CustomerId = C.Id
) AS OrderCount
FROM Customer C;
It is invalid, because in the sub query you are selecting from C.
This is a bit complicated to explain. In a query, we deal with tables and table rows. E.g.:
select person.name from person;
FROM person means "from the table person". person.name means "a person's name", so it is referring to a row. It would be great if we could write:
select person.name from persons;
but SQL doesn't know about singular and plural in your language, so this is not possible.
In your query FROM Customer C means "from the customer table, which I'm going to call C for short". But in the rest of the query including the sub query it is one customer row the C refers to. So you cannot say INNER JOIN C, because you can only join to a table, not a table row.
One might try to make this clear by using plural names for tables and singular names as table aliases. If you'd make it a habit, you'd have FROM Customers Customer in your main query and INNER JOIN Customer in your inner query, and you'd notice from your habits, that you cannot have a singular in the FROM clause. But well, one gets quickly accustomed to that double meaning (row and table) of a table name in a query, so this would just be kind of over-defensive, and we'll rather use alias names to get queries shorter and more readable, just as you are doing it with abbreviating customer to c.
But yes, you can use joins instead of sub queries in the SELECT clause. Either move the sub query to the FROM clause:
SELECT
c.firstname,
c.lastname,
COALESCE(o.ordercount, 0) AS ordercount
FROM customer c
LEFT JOIN
(
SELECT customerid, COUNT(*) AS ordercount
FROM [order]
GROUP BY customerid
) o ON o.customerid = c.id;
Or join without a sub query:
SELECT
c.firstname,
c.lastname,
COUNT(o.customerid) AS ordercount
FROM customer c
LEFT JOIN [order] o ON o.customerid = c.id
GROUP BY c.firstname, c.lastname;

The two queries are functionally equivalent. SQL (in the context of queries) is a declarative language, which means it works by DEFINING WHAT you want to achieve, not HOW to achieve it. So, at the abstract algebrical level, between the two queries there is absolutely no difference. (*)
However, because SQL does not work in the metaphysical realm of algebra but in the real world where the declarative language of SQL needs to be transposed in a procedural sequence of operations: it is much easier for me to decide the two queries are equivalent than for the RMDB of your choice. Computing the closure of the SQL declarative query can be incredibly computationally difficult. This is done by what is usually called the "query optimizer", which has not only the function of "understanding" the relational algebra but also of finding the probabilistically best way to implement it procedurally. Therefore, depending on the accuracy of the optimizer, the intricacy of your schema and query and the amount of computational resources the optimizer allocates on closing and optimizing the execution plan, the actual execution plans for the two otherwise equivalent queries can be different. You will still get the same results (as long as you stay in the declarative realm and don't use any NOW(), RAND() or other volatile state semantics), but one plan way may be faster, another may be slower. Also the order of results may be different, where ORDER BY is missing or equivocal.
Note: your join can be rewritten this way because it involves an aggregate on a side join. Not all joins can be transposed using subqueries, but there are plenty of situations of other queries that are equivalent although expressed differently. My answer is absolutely generic for any mathematically equivalent queries. See also explanation below.
(*) Queries equivalence also depends on schema. One usual enemy of common sense is NULL values: while a join will filter out null values if there is any condition on them, aggregates will behave in variuos other ways: SUM will be null, MAX/MIN will ignore nulls, COUNT will count anything, COUNT(DISTINCT) nobody knows what will do, etc.

Related

Join after Group by performance

Join tables and then group by multiple columns (like title) or group rows in sub-query and then join other tables?
Is the second method slow because of lack of indexes after grouping? Should I order rows manually for second method to trigger merge join instead of nested loop?
How to do it properly?
This is the first method. Became quite a mess cause of contragent_title and product_title are required to be in group by for strict mode. And I work with strict group by mode only.
SELECT
s.contragent_id,
s.contragent_title,
s.product_id AS sort_id,
s.product_title AS sort_title,
COALESCE(SUM(s.amount), 0) AS amount,
COALESCE(SUM(s.price), 0) AS price,
COALESCE(SUM(s.discount), 0) AS discount,
COUNT(DISTINCT s.product_id) AS sorts_count,
COUNT(DISTINCT s.contragent_id) AS contragents_count,
dd.date,
~grouping(dd.date, s.contragent_id, s.product_id) :: bit(3) AS mask
FROM date_dimension dd
LEFT JOIN (
SELECT
s.id,
s.created_at,
s.contragent_id,
ca.title AS contragent_title,
p.id AS product_id,
p.title AS product_title,
sp.amount,
sp.price,
sp.discount
FROM sales s
LEFT JOIN sold_products sp
ON s.id = sp.sale_id
LEFT JOIN products p
ON sp.product_id = p.id
LEFT JOIN contragents ca
ON s.contragent_id = ca.id
WHERE s.created_at BETWEEN :caf AND :cat
AND s.plant_id = :plant_id
AND (s.is_cache = :is_cache OR :is_cache IS NULL)
AND (sp.product_id = :sort_id OR :sort_id IS NULL)
) s ON dd.date = date(s.created_at)
WHERE (dd.date BETWEEN :caf AND :cat)
GROUP BY GROUPING SETS (
(dd.date, s.contragent_id, s.contragent_title, s.product_id, s.product_title),
(dd.date, s.contragent_id, s.contragent_title),
(dd.date)
)
This is an example of what you are talking about:
Join, then aggregate:
select d.name, count(e.employee_id) as number_of_johns
from departments d
left join employees e on e.department_id = e.department_id
where e.first_name = 'John'
group by d.department_id;
Aggregate then join:
select d.name, coalesce(number_of_johns, 0) as number_of_johns
from departments d
left join
(
select department_id, count(*) as number_of_johns
from employees
where first_name = 'John'
group by department_id
) e on e.department_id = e.department_id;
Question
You want to know whether one is faster than the other, assuming the latter may be slower for loosing the direct table links via IDs. (While every query result is a table, and hence the subquery result also is, it is no physical table stored in the database and has hence no indexes.)
Thinking and guessing
Let's see what the queries do:
The first query is supposed to join all departments and employees and only keep the Johns. How will it do that? It will probably find all Johns first. If there is an index on employees(first_name), it will probably use that, otherwise it will read the full table. Then find the counts by department_id. If the index I talked about even contained the department (index on employees(first_name, department_id), the DBMS would now have the Johns presorted and could just count. If it doesn't the DBMS may order the employee rows now and count then or use some other method for counting. And if we were looking for two names instead of just one, the compound index would be of little or no benefit compared to the mere index on first_name. At last the DBMS will read all departments and join the found counts. But our count result rows are not a table, so there is no index we can use. Anyway, the DBMS will just either just loop over the results or have them sorted anyway, so the join is easy peasy. So far from what I think the DBMS will do. There are a lot of ifs in my assumptions and the DBMS may still have other methods to choose from or won't use an index at all because the tables are so small anyway, or whatever.
The second query, well, same same.
Answer
You see, we can only guess how a DBMS will approach joins with aggregations. It may or may not come up with the same execution plan for the two queries. A perfect DBMS would create the same plan, as the two queries do the same thing. A not so perfect DBMS may create different plans, but which is better we can hardly guess. Let's just rely on the DBMS to do a good job concerning this.
I am using Oracle mainly and just tried about the same thing as shown with two of my tables. It shows exactly the same execution plan for both queries. PostgreSQL is also a great DBMS. Nothing to worry about, I'd say :-)
Better focus on writing readable, maintainable queries. With these small queries there is no big difference; the first one is a tad mor compact and easy to grab, the second a tad more sophisticated.
I, personally, prefer the second query. It is good style to aggregate before joining and such queries can be easily extended with further aggregations, which can be much more difficult with the first one. Only if I ran into performance issues, I would try a different approach.

SQL Question: Does the order of the WHERE/INNER JOIN clause when interlinking table matter?

Exam Question (AQA A-level Computer Science):
[Primary keys shown by asterisks]
Athlete(*AthleteID*, Surname, Forename, DateOfBirth, Gender, TeamName)
EventType(*EventTypeID*, Gender, Distance, AgeGroup)
Fixture(*FixtureID*, FixtureDate, LocationName)
EventAtFixture(*FixtureID*, *EventTypeID*)
EventEntry(*FixtureID*, *EventTypeID*, *AthleteID*)
A list is to be produced of the names of all athletes who are competing in the fixture
that is taking place on 17/09/18. The list must include the Surname, Forename and
DateOfBirth of these athletes and no other details. The list should be presented in
alphabetical order by Surname.
Write an SQL query to produce the list.
I understand that you could do this two ways, one using a WHERE clause and the other using the INNER JOIN clause. However, I am wondering if the order matters when linking the tables.
First exemplar solution:
SELECT Surname, Forename, DateOfBirth
FROM Athlete, EventEntry, Fixture
WHERE FixtureDate = "17/09/2018"
AND Athlete.AthleteID = EventEntry.AthleteID
AND EventEntry.FixtureID = Fixture.FixtureID
ORDER BY Surname
Here is the first exemplar solution, would it still be correct if I was to switch the order of the tables in the WHERE clause, for example:
WHERE FixtureDate = "17/09/2018"
AND EventEntry.AthleteID = Athlete.AthleteID
AND Fixture.FixtureID = EventEntry.FixtureID
I have the same question for the INNER JOIN clause to, here is the second exemplar solution:
SELECT Surname, Forename, DateOfBirth
FROM Athlete
INNER JOIN EventEntry ON Athlete.AthleteID = EventEntry.AthleteID
INNER JOIN Fixture ON EventEntry.FixtureID = Fixture.FixtureID
WHERE FixtureDate = "17/09/2018"
ORDER BY Surname
Again, would it be correct if I used this order instead:
INNER JOIN EventEntry ON Fixture.FixtureID = EventEntry.FixtureID
If the order does matter, could somebody explain to me why it is in the order shown in the examples?
Some advice:
Never use commas in the FROM clause. Always use proper, explicit, standard JOIN syntax.
Use table aliases that are abbreviations for the table names.
Use standard date formats!
Qualify all column names.
Then, the order of the comparisons doesn't matter for equality. I would recommend using a canonical ordering.
So, the query should look more like:
SELECT a.Surname, a.Forename, a.DateOfBirth
FROM Athlete a INNER JOIN
EventEntry ee
ON a.AthleteID = ee.AthleteID INNER JOIN
Fixture f
ON ee.FixtureID = f.FixtureID
WHERE a.FixtureDate = '2018-09-17'
ORDER BY a.Surname;
I am guessing that all the columns in the SELECT come from Athlete. If that is not true, then adjust the table aliases.
There are lots of stylistic conventions for SQL and #gordonlinoff's answer mentions some of the perennial ones.
There are a few answers to your question.
The most important is that (notionally) SQL is a declarative language - you tell it what you want it to do, not how to do it. In a procedural language (like C, or Java, or PHP), the order of execution really matters - the sequence of instructions is part of the procedure. In a declarative language, the order doesn't really matter.
This wasn't always totally true - older query optimizers seemed to like the more selective where clauses earlier on in the statement for performance reasons. I haven't seen that for a couple of decades now, so assume that's not really a thing.
Because order doesn't matter, but correctly understanding the intent of a query does, many SQL developers emphasize readability. That's why we like explicit join syntax, and meaningful aliases. And for readability, the sequence of instructions can help. I favour starting with the "most important" table, usually the one from which you're selecting most columns, and then follow a logical chain of joins from one table to the next. This makes it easier to follow the logic.
When you use inner joins order does not matter as long as the prerequisite table is above/before. At your example both joins start from table Athlete so order doesn't matter. If however this very query is found starting from EventEntry (for any reason), then you must join at Athlete at the first inner else you cannot join to Fixture. As recommended, it is best to use standard join syntax and preferable place all inner joins before all lefts. If you cant then you need to review because the left you need to put inside the group of inner joins will probably behave like an inner join. That is because an inner below uses the left table else you could place it below the inner block. So when it comes to null the left will be ok but the inner below will cut the record.
When however the above cases do not exist/affect order and all inner joins can be placed at any order, only performance matters. Usually table with high cardinality on top perform better while there are cases where the opposite works better. So if the order is free you may try higher to lower cardinality tables ordering or the opposite - whatever works faster.
Clarifying: As prerequisite table i call the table needed by the joined table by condition: ... join B on [whatever] join C on c.id=b.cid - here table B is prerequisite for table C.
I mention left joins because while the question is about inner order, when joins are mixed (inners and lefts)then order of joins alone is important (to be all above) as may affect query logic:
... join B on [whatever] left join C on c.id=b.cid join D on D.id = C.did
At the above example the left join sneaks into the inner joins order. We cannot order it after D because it is prerequisite for D. For records however where condition c.id=b.cid is not true the entire B table row turns null and then the entire result row (B+C+D) turns off the results because of D.id = C.did condition of the following inner join. This example needs review as the purpose of left join evaporates by the following (next on order) inner join. Concluding, the order of inner joins when mixed with lefts is better to be on top without any left joins interfering.

Sub query select statment vs inner join

I'm confused about these two statements who's faster and more common to use and best for memory
select p.id, p.name, w.id, w.name
from person p
inner join work w on w.id = p.wid
where p.id in (somenumbers)
vs
select p.id, p.name, (select id from work where id=p.wid) , (select name from work where id=p.wid)
from person p
where p.id in (somenumbers)
The whole idea of this is that if I have I huge database and I want to make inner join it will take memory and less performance to johin work table and person table but the sub query select statments it will only select one statment at the time so which is the best here
First, the two queries are not the same. The first filters out any rows that have no matching rows in work.
The equivalent first query uses a left join:
select p.id, p.name, w.id, w.name
from person p left join
work w
on w.id = p.wid
where p.id in (somenumbers);
Then, the second query can be simplified to:
select p.id, p.name, p.wid,
(select name from work where w.id = p.wid)
from person p
where p.id in (somenumbers);
There is no reason to look up the id in work when it is already present in person.
If you want optimized queries, then you want indexes on person(id, wid, name) and work(id, name).
With these indexes, the two queries should have basically the same performance. The subquery will use the index on work for fetching the rows from work and the where clause will use the index on person. Either query should be fast and scalable.
The subqueries in your second example will execute once for every row, which will perform badly. That said, some optimizers may be able to convert it to a join for you - YMMV.
A good rule to follow in general is: much prefer joins to subqueries.
joins give better performance as comparison with sub-query .if there is join on Int column or have index on join column gives best performance .
select p.id, p.name, w.id, w.name
from person p
inner join work w on w.id = p.wid
where p.id in (somenumbers)
It really depends on how you want to optimaze the query (includie but not limited to add/removing/reordering the index),
I found the setup which makes join soars might let subquery suffer, the opposite may also be true. Thus there is not that much point to compare them with the same setup.
I choose to use and optimize with join. In my experince join at its best condition setup, rarely loses to subquery, but a lot eaiser to read.
When the vendor stuff an extreme load of queries with subqueries to the system. Unless the performance start to crawl, due to my other work's query optimization, it simply doesn't worth the effort to change them.

SQL Cross Join better in performance than normal join?

I'm currently working with SQL and wondered about cross join.
Assuming I have the following relations:
customer(customerid, firstname, lastname)
transact(customerid, productid, date, quantity)
product(productid, description)
This query is written in Oracle SQL. It should select the last name of all customers which bought more than 1000 quantities of a product (rather senseless but no matter):
SELECT c.lastname, t.date
FROM customer c, transact t
WHERE t.quantity > 1000
AND t.customerid = c.customerid
Isn't this doing a cross join?! Isn't this extremely slow when the tables consist of a huge amount of data?
Isn't it better to do something like this:
SELECT c.lastname, t.date
FROM customer c
JOIN transact t ON(c.customerid = t.customerid)
WHERE t.quantity > 1000
Which is better in performance? And how are these queries handled internally?
Thanks for your help,
Barbara
The two queries aren't equivalent, because:
SELECT lastname, date
FROM customer, transact
WHERE quantity > 1000
Doesn't actually limit to customers that bought > 1000, it's simply taking every combination of rows from those two tables, and excluding any with quantity less than or equal to 1000 (all customers will be returned).
This query is equivalent to your JOIN version:
SELECT lastname, date
FROM customer c, transact t
WHERE quantity > 1000
AND c.customerid = t.customerid
The explicit JOIN version is preferred as it's not deprecated syntax, but both should have the same execution plan and identical performance. The explicit JOIN version is easier to read in my opinion, but the fact that the comma listed/implicit method has been outdated for over a decade (two?) should be enough reason to avoid it.
This is too long for a comment.
If you want to know how they are handled then look at the query plan.
In your case, the queries are not the same. The first does a cross join with conditions on only one table. The second does a legitimate join. The second is the right way to write the query.
However, even if you included the correct where clause in the first query, then the performance should be the same. Oracle is smart enough to recognize that the two queries do the same thing (if written correctly).
Simple rule: never use commas in the from clause. Always use explicit join syntax.

Why change the ordering of the tables in the FROM clause makes the SQL execution time different?

Edit: I don't know why the hate for this question but maybe its because of the confusion about my question. I purposely used /*+ ORDERED */ to control the order of execution and changes the ordering of the tables in the FROM clause. I was wondering WHY the execution time can change. Is it because of the join order? is it because of the table size? Hope this clears out the confusion.
So I was just playing around SQL queries and realized the following: If I change the ordering of tables in the FROM clause, the execution time can be very different. The following query runs in about 0.966 sec. But if I move OrderDetails d to the last of the FROM clause, the execution is only 0.573 sec! Any reason behind this? I was using ORACLE SQL Developer
SELECT /*+ ORDERED */
su.CompanyName, CategoryName, ProductName, c.CompanyName, c.country,
FirstName, LastName, Quantity, d.UnitPrice, sh.CompanyName
FROM
OrderDetails d, Suppliers su, Shippers sh, Categories t, Products p,
Employees e, Customers c, orders o
WHERE
t.CategoryID = p.CategoryID
AND c.CustomerID = o.CustomerID
AND e.EmployeeID = o.EmployeeID
AND o.OrderID = d.OrderID
AND p.ProductID = d.ProductID
AND sh.ShipperID = o.ShipVia
AND su.SupplierID = p.SupplierID
AND LOWER(ProductName) Like '%lager%'
AND LOWER(c.city) IN ('vancouver', 'london', 'charleroi', 'cunewalde')
AND d.Quantity BETWEEN 5 AND 100
AND (RequiredDate-ShippedDate > 10)
ORDER BY
c.CompanyName;
Uh, you are specifying the ordered hint. As described in the documentation:
The ORDERED hint causes Oracle to join tables in the order in which
they appear in the FROM clause.
Usually, the Oracle (or any other optimizer) finds an optimal ordering for the joins, so the ordering in the from clause does not matter. But with the ordered hint, you are specifying the order of the joins. Hence, changing the order of tables in the from clause can have a big impact on execution.
By the way, you should learn to use modern, explicit join syntax.
You have a + ORDERED optimizer hint.
http://docs.oracle.com/cd/B10500_01/server.920/a96533/hintsref.htm#5555
The ORDERED hint causes Oracle to join tables in the order in which they appear in the FROM clause.
For fully understanding the matter I would recommend reading a database book, especially chapter for algorithms for searching and joining and chapter for query optimizations.
For example, in the nested loop join algorithm we put the larger table in the outer loop and the smaller in the inner loop. That way we get less disk accesses.
The outer loop loads same data only once, and the inner loads the same data multiple times. That's why we iterate the larger table in the outer loop.