Nested Query or Joins - sql

I have heard joins should be preferred over nested queries. Is it true in general? Or there might be scenarios where one would be faster than other:
for e.g. which is more efficient way to write a query?:
Select emp.salary
from employee emp
where emp.id = (select s.id from sap s where s.id = 111)
OR
Select emp.salary
from employee emp
INNER JOIN sap s ON emp.id = s.id
WHERE s.id = 111

I have heard joins should be preferred over nested queries. Is it true in general?
It depends on the requirements, and the data.
Using a JOIN risks duplicating the information in the resultset for the parent table if there are more than one child records related to it, because a JOIN returns the rows that match. Which means if you want unique values from the parent table while using JOINs, you need to look at using either DISTINCT or a GROUP BY clause. But none of this is a concern if a subquery is used.
Also, subqueries are not all the same. There's the straight evaluation, like your example:
where emp.id = (select s.id from sap s where s.id = 111)
...and the IN clause:
where emp.id IN (select s.id from sap s where s.id = 111)
...which will match any of the value(s) returned by the subquery when the straight evaluation will throw an error if s.id returns more than one value. But there's also the EXISTS clause...
WHERE EXISTS(SELECT NULL
FROM SAP s
WHERE emp.id = s.id
AND s.id = 111)
The EXISTS is different in that:
the SELECT clause doesn't get evaluated - you can change it to SELECT 1/0, which should trigger a divide-by-zero error but won't
it returns true/false; true based on the first instance the criteria is satisfied so it's faster when dealing with duplicates.
unlike the IN clause, EXISTS supports comparing two or more column comparisons at the same time, but some databases do support tuple comparison with the IN.
it's more readable

If the queries are logically equivalent, then the query optimizer should be able to make the same (best) execution plan from each one. In that case, query style should support what can be understood the best (that's subqueries for me).

It is much faster (and easier to write) to join two tables on an index than to run two separate queries (even a subquery).

Related

Join after Group by performance

Join tables and then group by multiple columns (like title) or group rows in sub-query and then join other tables?
Is the second method slow because of lack of indexes after grouping? Should I order rows manually for second method to trigger merge join instead of nested loop?
How to do it properly?
This is the first method. Became quite a mess cause of contragent_title and product_title are required to be in group by for strict mode. And I work with strict group by mode only.
SELECT
s.contragent_id,
s.contragent_title,
s.product_id AS sort_id,
s.product_title AS sort_title,
COALESCE(SUM(s.amount), 0) AS amount,
COALESCE(SUM(s.price), 0) AS price,
COALESCE(SUM(s.discount), 0) AS discount,
COUNT(DISTINCT s.product_id) AS sorts_count,
COUNT(DISTINCT s.contragent_id) AS contragents_count,
dd.date,
~grouping(dd.date, s.contragent_id, s.product_id) :: bit(3) AS mask
FROM date_dimension dd
LEFT JOIN (
SELECT
s.id,
s.created_at,
s.contragent_id,
ca.title AS contragent_title,
p.id AS product_id,
p.title AS product_title,
sp.amount,
sp.price,
sp.discount
FROM sales s
LEFT JOIN sold_products sp
ON s.id = sp.sale_id
LEFT JOIN products p
ON sp.product_id = p.id
LEFT JOIN contragents ca
ON s.contragent_id = ca.id
WHERE s.created_at BETWEEN :caf AND :cat
AND s.plant_id = :plant_id
AND (s.is_cache = :is_cache OR :is_cache IS NULL)
AND (sp.product_id = :sort_id OR :sort_id IS NULL)
) s ON dd.date = date(s.created_at)
WHERE (dd.date BETWEEN :caf AND :cat)
GROUP BY GROUPING SETS (
(dd.date, s.contragent_id, s.contragent_title, s.product_id, s.product_title),
(dd.date, s.contragent_id, s.contragent_title),
(dd.date)
)
This is an example of what you are talking about:
Join, then aggregate:
select d.name, count(e.employee_id) as number_of_johns
from departments d
left join employees e on e.department_id = e.department_id
where e.first_name = 'John'
group by d.department_id;
Aggregate then join:
select d.name, coalesce(number_of_johns, 0) as number_of_johns
from departments d
left join
(
select department_id, count(*) as number_of_johns
from employees
where first_name = 'John'
group by department_id
) e on e.department_id = e.department_id;
Question
You want to know whether one is faster than the other, assuming the latter may be slower for loosing the direct table links via IDs. (While every query result is a table, and hence the subquery result also is, it is no physical table stored in the database and has hence no indexes.)
Thinking and guessing
Let's see what the queries do:
The first query is supposed to join all departments and employees and only keep the Johns. How will it do that? It will probably find all Johns first. If there is an index on employees(first_name), it will probably use that, otherwise it will read the full table. Then find the counts by department_id. If the index I talked about even contained the department (index on employees(first_name, department_id), the DBMS would now have the Johns presorted and could just count. If it doesn't the DBMS may order the employee rows now and count then or use some other method for counting. And if we were looking for two names instead of just one, the compound index would be of little or no benefit compared to the mere index on first_name. At last the DBMS will read all departments and join the found counts. But our count result rows are not a table, so there is no index we can use. Anyway, the DBMS will just either just loop over the results or have them sorted anyway, so the join is easy peasy. So far from what I think the DBMS will do. There are a lot of ifs in my assumptions and the DBMS may still have other methods to choose from or won't use an index at all because the tables are so small anyway, or whatever.
The second query, well, same same.
Answer
You see, we can only guess how a DBMS will approach joins with aggregations. It may or may not come up with the same execution plan for the two queries. A perfect DBMS would create the same plan, as the two queries do the same thing. A not so perfect DBMS may create different plans, but which is better we can hardly guess. Let's just rely on the DBMS to do a good job concerning this.
I am using Oracle mainly and just tried about the same thing as shown with two of my tables. It shows exactly the same execution plan for both queries. PostgreSQL is also a great DBMS. Nothing to worry about, I'd say :-)
Better focus on writing readable, maintainable queries. With these small queries there is no big difference; the first one is a tad mor compact and easy to grab, the second a tad more sophisticated.
I, personally, prefer the second query. It is good style to aggregate before joining and such queries can be easily extended with further aggregations, which can be much more difficult with the first one. Only if I ran into performance issues, I would try a different approach.

Can a correlated subquery be replaced by an inner join?

Given that I'm asking about a small subset of possible uses of a correlated subquery...
I'm working with a vendor product that uses a lot of correlated subqueries. My task is to modifiy the queries to make them meet our business needs. The vendor's queries appear to be overly complicated and difficult to maintain.
One simple example...
select e.EmployeeLastName
, e.EmployeeFirstName
, e.EmployeeRecordEffectiveDate
, e.EmployeeRecordEndDate
from Employee e
where e.EmployeeLastName like 'a%'
and exists (
select 1
from Position p
where p.PositionId = e.EmployeePositionID
)
order by e.EmployeeRecordEffectiveDate
;
...can be replaced by a simpler query using an inner join...
select e.EmployeeLastName
, e.EmployeeFirstName
, e.EmployeeRecordEffectiveDate
, e.EmployeeRecordEndDate
from Employee e
inner join Position p on p.PositionId = e.EmployeePositionID
where e.EmployeeLastName like 'a%'
order by e.EmployeeRecordEffectiveDate
;
Of course, most of the queries are more complicated. For example, replace the Position table with a subquery - but keep the main structure.
Are there risks involved with making the queries simpler? (maybe involving NULLs...)
Is there an advantage to using a correlated subquery?
I would compare the execution plans between the queries that have correlated subqueries and those that have them replaced with a join. While in many cases correlated subqueries can lead to performance issues if not written properly, there are others where they can actually perform better. For example, a NOT EXISTS subquery usually performs better than a LEFT JOIN ... WHERE [column] IS NULL, while EXISTS subqueries typically perform the same as simple INNER joins.
With that being said, you should compare the performance differences on an individual basis by replacing the correlated subqueries with joins one-by-one.
Assuming that Position(PositionId) is unique, then your two queries are equivalent.
I'm not sure that you will get any performance benefit. EXISTS is usually fine from that perspective.

Semi-join vs Subqueries

What is the difference between semi-joins and a subquery? I am currently taking a course on this on DataCamp and i'm having a hard time making a distinction between the two.
Thanks in advance.
A join or a semi join is required whenever you want to combine two or more entities records based on some common conditional attributes.
Unlike, Subquery is required whenever you want to have a lookup or a reference on same table or other tables
In short, when your requirement is to get additional reference columns added to existing tables attributes then go for join else when you want to have a lookup on records from the same table or other tables but keeping the same existing columns as o/p go for subquery
Also, In case of semi join it can act/used as a subquery because most of the times we dont actually join the right table instead we mantain a check via subquery to limit records in the existing hence semijoin but just that it isnt a subquery by itself
I don't really think of a subquery and a semi-join as anything similar. A subquery is nothing more interesting than a query that is used inside another query:
select * -- this is often called the "outer" query
from (
select columnA -- this is the subquery inside the parentheses
from mytable
where columnB = 'Y'
)
A semi-join is a concept based on join. Of course, joining tables will combine both tables and return the combined rows based on the join criteria. From there you select the columns you want from either table based on further where criteria (and of course whatever else you want to do). The concept of a semi-join is when you want to return rows from the first table only, but you need the 2nd table to decide which rows to return. Example: you want to return the people in a class:
select p.FirstName, p.LastName, p.DOB
from people p
inner join classes c on c.pID = p.pID
where c.ClassName = 'SQL 101'
group by p.pID
This accomplishes the concept of a semi-join. We are only returning columns from the first table (people). The use of the group by is necessary for the concept of a semi-join because a true join can return duplicate rows from the first table (depending on the join criteria). The above example is not often referred to as a semi-join, and is not the most typical way to accomplish it. The following query is a more common method of accomplishing a semi-join:
select FirstName, LastName, DOB
from people
where pID in (select pID
from class
where ClassName = 'SQL 101'
)
There is no formal join here. But we're using the 2nd table to determine which rows from the first table to return. It's a lot like saying if we did join the 2nd table to the first table, what rows from the first table would match?
For performance, exists is typically preferred:
select FirstName, LastName, DOB
from people p
where exists (select pID
from class c
where c.pID = p.pID
and c.ClassName = 'SQL 101'
)
In my opinion, this is the most direct way to understand the semi-join. There is still no formal join, but you can see the idea of a join hinted at by the usage of directly matching the first table's pID column to the 2nd table's pID column.
Final note. The last 2 queries above each use a subquery to accomplish the concept of a semi-join.

SQL Question: Does the order of the WHERE/INNER JOIN clause when interlinking table matter?

Exam Question (AQA A-level Computer Science):
[Primary keys shown by asterisks]
Athlete(*AthleteID*, Surname, Forename, DateOfBirth, Gender, TeamName)
EventType(*EventTypeID*, Gender, Distance, AgeGroup)
Fixture(*FixtureID*, FixtureDate, LocationName)
EventAtFixture(*FixtureID*, *EventTypeID*)
EventEntry(*FixtureID*, *EventTypeID*, *AthleteID*)
A list is to be produced of the names of all athletes who are competing in the fixture
that is taking place on 17/09/18. The list must include the Surname, Forename and
DateOfBirth of these athletes and no other details. The list should be presented in
alphabetical order by Surname.
Write an SQL query to produce the list.
I understand that you could do this two ways, one using a WHERE clause and the other using the INNER JOIN clause. However, I am wondering if the order matters when linking the tables.
First exemplar solution:
SELECT Surname, Forename, DateOfBirth
FROM Athlete, EventEntry, Fixture
WHERE FixtureDate = "17/09/2018"
AND Athlete.AthleteID = EventEntry.AthleteID
AND EventEntry.FixtureID = Fixture.FixtureID
ORDER BY Surname
Here is the first exemplar solution, would it still be correct if I was to switch the order of the tables in the WHERE clause, for example:
WHERE FixtureDate = "17/09/2018"
AND EventEntry.AthleteID = Athlete.AthleteID
AND Fixture.FixtureID = EventEntry.FixtureID
I have the same question for the INNER JOIN clause to, here is the second exemplar solution:
SELECT Surname, Forename, DateOfBirth
FROM Athlete
INNER JOIN EventEntry ON Athlete.AthleteID = EventEntry.AthleteID
INNER JOIN Fixture ON EventEntry.FixtureID = Fixture.FixtureID
WHERE FixtureDate = "17/09/2018"
ORDER BY Surname
Again, would it be correct if I used this order instead:
INNER JOIN EventEntry ON Fixture.FixtureID = EventEntry.FixtureID
If the order does matter, could somebody explain to me why it is in the order shown in the examples?
Some advice:
Never use commas in the FROM clause. Always use proper, explicit, standard JOIN syntax.
Use table aliases that are abbreviations for the table names.
Use standard date formats!
Qualify all column names.
Then, the order of the comparisons doesn't matter for equality. I would recommend using a canonical ordering.
So, the query should look more like:
SELECT a.Surname, a.Forename, a.DateOfBirth
FROM Athlete a INNER JOIN
EventEntry ee
ON a.AthleteID = ee.AthleteID INNER JOIN
Fixture f
ON ee.FixtureID = f.FixtureID
WHERE a.FixtureDate = '2018-09-17'
ORDER BY a.Surname;
I am guessing that all the columns in the SELECT come from Athlete. If that is not true, then adjust the table aliases.
There are lots of stylistic conventions for SQL and #gordonlinoff's answer mentions some of the perennial ones.
There are a few answers to your question.
The most important is that (notionally) SQL is a declarative language - you tell it what you want it to do, not how to do it. In a procedural language (like C, or Java, or PHP), the order of execution really matters - the sequence of instructions is part of the procedure. In a declarative language, the order doesn't really matter.
This wasn't always totally true - older query optimizers seemed to like the more selective where clauses earlier on in the statement for performance reasons. I haven't seen that for a couple of decades now, so assume that's not really a thing.
Because order doesn't matter, but correctly understanding the intent of a query does, many SQL developers emphasize readability. That's why we like explicit join syntax, and meaningful aliases. And for readability, the sequence of instructions can help. I favour starting with the "most important" table, usually the one from which you're selecting most columns, and then follow a logical chain of joins from one table to the next. This makes it easier to follow the logic.
When you use inner joins order does not matter as long as the prerequisite table is above/before. At your example both joins start from table Athlete so order doesn't matter. If however this very query is found starting from EventEntry (for any reason), then you must join at Athlete at the first inner else you cannot join to Fixture. As recommended, it is best to use standard join syntax and preferable place all inner joins before all lefts. If you cant then you need to review because the left you need to put inside the group of inner joins will probably behave like an inner join. That is because an inner below uses the left table else you could place it below the inner block. So when it comes to null the left will be ok but the inner below will cut the record.
When however the above cases do not exist/affect order and all inner joins can be placed at any order, only performance matters. Usually table with high cardinality on top perform better while there are cases where the opposite works better. So if the order is free you may try higher to lower cardinality tables ordering or the opposite - whatever works faster.
Clarifying: As prerequisite table i call the table needed by the joined table by condition: ... join B on [whatever] join C on c.id=b.cid - here table B is prerequisite for table C.
I mention left joins because while the question is about inner order, when joins are mixed (inners and lefts)then order of joins alone is important (to be all above) as may affect query logic:
... join B on [whatever] left join C on c.id=b.cid join D on D.id = C.did
At the above example the left join sneaks into the inner joins order. We cannot order it after D because it is prerequisite for D. For records however where condition c.id=b.cid is not true the entire B table row turns null and then the entire result row (B+C+D) turns off the results because of D.id = C.did condition of the following inner join. This example needs review as the purpose of left join evaporates by the following (next on order) inner join. Concluding, the order of inner joins when mixed with lefts is better to be on top without any left joins interfering.

Whether Inner Queries Are Okay?

I often see something like...
SELECT events.id, events.begin_on, events.name
FROM events
WHERE events.user_id IN ( SELECT contacts.user_id
FROM contacts
WHERE contacts.contact_id = '1')
OR events.user_id IN ( SELECT contacts.contact_id
FROM contacts
WHERE contacts.user_id = '1')
Is it okay to have query in query? Is it "inner query"? "Sub-query"? Does it counts as three queries (my example)? If its bad to do so... how can I rewrite my example?
Your example isn't too bad. The biggest problems usually come from cases where there is what's called a "correlated subquery". That's when the subquery is dependent on a column from the outer query. These are particularly bad because the subquery effectively needs to be rerun for every row in the potential results.
You can rewrite your subqueries using joins and GROUP BY, but as you have it performance can vary, especially depending on your RDBMS.
It varies from database to database, especially if the columns compared are
indexed or not
nullable or not
..., but generally if your query is not using columns from the table joined to -- you should be using either IN or EXISTS:
SELECT e.id, e.begin_on, e.name
FROM EVENTS e
WHERE EXISTS (SELECT NULL
FROM CONTACTS c
WHERE ( c.contact_id = '1' AND c.user_id = e.user_id )
OR ( c.user_id = '1' AND c.contact_id = e.user_id )
Using a JOIN (INNER or OUTER) can inflate records if the child table has more than one record related to a parent table record. That's fine if you need that information, but if not then you need to use either GROUP BY or DISTINCT to get a result set of unique values -- and that can cost you when you review the query costs.
EXISTS
Though EXISTS clauses look like correlated subqueries, they do not execute as such (RBAR: Row By Agonizing Row). EXISTS returns a boolean based on the criteria provided, and exits on the first instance that is true -- this can make it faster than IN when dealing with duplicates in a child table.
You could JOIN to the Contacts table instead:
SELECT events.id, events.begin_on, events.name
FROM events
JOIN contacts
ON (events.user_id = contacts.contact_id OR events.user_id = contacts.user_id)
WHERE events.user_id = '1'
GROUP BY events.id
-- exercise: without the GROUP BY, how many duplicate rows can you end up with?
This leaves the following question up to the database: "Should we look through all the contacts table and find all the '1's in the various columns, or do something else?" where your original SQL didn't give it much choice.
The most common term for this sort of query is "subquery." There is nothing inherently wrong in using them, and can make your life easier. However, performance can often be improved by rewriting queries w/ subqueries to use JOINs instead, because the server can find optimizations.
In your example, three queries are executed: the main SELECT query, and the two SELECT subqueries.
SELECT events.id, events.begin_on, events.name
FROM events
JOIN contacts
ON (events.user_id = contacts.contact_id OR events.user_id = contacts.user_id)
WHERE events.user_id = '1'
GROUP BY events.id
In your case, I believe the JOIN version will be better as you can avoid two SELECT queries on contacts, opting for the JOIN instead.
See the mysql docs on the topic.