Join after Group by performance - sql

Join tables and then group by multiple columns (like title) or group rows in sub-query and then join other tables?
Is the second method slow because of lack of indexes after grouping? Should I order rows manually for second method to trigger merge join instead of nested loop?
How to do it properly?
This is the first method. Became quite a mess cause of contragent_title and product_title are required to be in group by for strict mode. And I work with strict group by mode only.
SELECT
s.contragent_id,
s.contragent_title,
s.product_id AS sort_id,
s.product_title AS sort_title,
COALESCE(SUM(s.amount), 0) AS amount,
COALESCE(SUM(s.price), 0) AS price,
COALESCE(SUM(s.discount), 0) AS discount,
COUNT(DISTINCT s.product_id) AS sorts_count,
COUNT(DISTINCT s.contragent_id) AS contragents_count,
dd.date,
~grouping(dd.date, s.contragent_id, s.product_id) :: bit(3) AS mask
FROM date_dimension dd
LEFT JOIN (
SELECT
s.id,
s.created_at,
s.contragent_id,
ca.title AS contragent_title,
p.id AS product_id,
p.title AS product_title,
sp.amount,
sp.price,
sp.discount
FROM sales s
LEFT JOIN sold_products sp
ON s.id = sp.sale_id
LEFT JOIN products p
ON sp.product_id = p.id
LEFT JOIN contragents ca
ON s.contragent_id = ca.id
WHERE s.created_at BETWEEN :caf AND :cat
AND s.plant_id = :plant_id
AND (s.is_cache = :is_cache OR :is_cache IS NULL)
AND (sp.product_id = :sort_id OR :sort_id IS NULL)
) s ON dd.date = date(s.created_at)
WHERE (dd.date BETWEEN :caf AND :cat)
GROUP BY GROUPING SETS (
(dd.date, s.contragent_id, s.contragent_title, s.product_id, s.product_title),
(dd.date, s.contragent_id, s.contragent_title),
(dd.date)
)

This is an example of what you are talking about:
Join, then aggregate:
select d.name, count(e.employee_id) as number_of_johns
from departments d
left join employees e on e.department_id = e.department_id
where e.first_name = 'John'
group by d.department_id;
Aggregate then join:
select d.name, coalesce(number_of_johns, 0) as number_of_johns
from departments d
left join
(
select department_id, count(*) as number_of_johns
from employees
where first_name = 'John'
group by department_id
) e on e.department_id = e.department_id;
Question
You want to know whether one is faster than the other, assuming the latter may be slower for loosing the direct table links via IDs. (While every query result is a table, and hence the subquery result also is, it is no physical table stored in the database and has hence no indexes.)
Thinking and guessing
Let's see what the queries do:
The first query is supposed to join all departments and employees and only keep the Johns. How will it do that? It will probably find all Johns first. If there is an index on employees(first_name), it will probably use that, otherwise it will read the full table. Then find the counts by department_id. If the index I talked about even contained the department (index on employees(first_name, department_id), the DBMS would now have the Johns presorted and could just count. If it doesn't the DBMS may order the employee rows now and count then or use some other method for counting. And if we were looking for two names instead of just one, the compound index would be of little or no benefit compared to the mere index on first_name. At last the DBMS will read all departments and join the found counts. But our count result rows are not a table, so there is no index we can use. Anyway, the DBMS will just either just loop over the results or have them sorted anyway, so the join is easy peasy. So far from what I think the DBMS will do. There are a lot of ifs in my assumptions and the DBMS may still have other methods to choose from or won't use an index at all because the tables are so small anyway, or whatever.
The second query, well, same same.
Answer
You see, we can only guess how a DBMS will approach joins with aggregations. It may or may not come up with the same execution plan for the two queries. A perfect DBMS would create the same plan, as the two queries do the same thing. A not so perfect DBMS may create different plans, but which is better we can hardly guess. Let's just rely on the DBMS to do a good job concerning this.
I am using Oracle mainly and just tried about the same thing as shown with two of my tables. It shows exactly the same execution plan for both queries. PostgreSQL is also a great DBMS. Nothing to worry about, I'd say :-)
Better focus on writing readable, maintainable queries. With these small queries there is no big difference; the first one is a tad mor compact and easy to grab, the second a tad more sophisticated.
I, personally, prefer the second query. It is good style to aggregate before joining and such queries can be easily extended with further aggregations, which can be much more difficult with the first one. Only if I ran into performance issues, I would try a different approach.

Related

How to re-write SQL query to be more efficient?

I've got a query that's decently sized on it's own, but there's one section of it that turns it into something ridiculously large (billions of rows returned type thing).
There must be a better way to write it than what I have done.
To simplify the section of the query in question, it takes the client details from one table and tries to find the most recent transaction dates in their savings and spending accounts (not the actual situation, but close enough).
I've joined it with left joins because if someone (for example) doesn't have a savings account, I still want the client details to pop up. But when there's hundreds of thousands of clients with tens of thousands of transactions, it's a little slow to run.
select client_id, max(e.transation_date), max(s.transaction_date)
from client_table c
left join everyday_account e
on c.client_id = e.client_id
left join savings_account s
on c.client_id = s.client_id
group by client_id
I'm still new to this so I'm not great at knowing how to optimise things, so is there any thing I should be looking at? Perhaps different joins, or something other than max()?
I've probably missed some key details while trying to simplify it, let me know if so!
Sometimes aggregating first, then joining to the aggregated result is faster. But this depends on the actual DBMS being used and several other factors.
select client_id, e.max_everyday_transaction_date, s.max_savings_transaction_date
from client_table c
left join (
select client_id, max(transaction_date) as max_everyday_transaction_date
from everyday_account
group by client_id
) e on c.client_id = e.client_id
left join (
select client_id, max(transaction_date) as max_savings_transaction_date
from savings_account
) s on c.client_id = s.client_id
The indexes suggested by Tim Biegeleisen should help in this case as well.
But as the query has to process all rows from all tables there no good way to speed up this query, other than throwing more hardware at it. If your database supports it, make sure parallel query is enabled (which will distribute the total work over multiple threads in the backend which can substantially improve query performance if the I/O system can keep up)
There are no WHERE or HAVING clauses, which basically means there is no explicit filtering in your SQL query. However, we can still try to optimize the joins using appropriate indices. Consider:
CREATE INDEX idx1 ON everyday_account (client_id, transation_date);
CREATE INDEX idx2 ON savings_account (client_id, transation_date);
These two indices, if chosen for use, should speed up the two left joins in your query. I also cover the transaction_date in both cases, in case that might help.
Side note: You might want to also consider just having a single table containing all customer accounts. Include a separate column which distinguishes between everyday and savings accounts.
I would suggest correlated subqueries:
select client_id,
(select max(e.transation_date)
from everyday_account e
where c.client_id = e.client_id
),
(select max(s.transaction_date)
from savings_account s
where c.client_id = s.client_id
)
from client_table c;
Along with indexes on everyday_account(client_id, transaction_date desc) and savings_account(client_id, transaction_date desc).
The subqueries should basically be index lookups (or very limited index scans), with no additional joining needed.

SQL Question: Does the order of the WHERE/INNER JOIN clause when interlinking table matter?

Exam Question (AQA A-level Computer Science):
[Primary keys shown by asterisks]
Athlete(*AthleteID*, Surname, Forename, DateOfBirth, Gender, TeamName)
EventType(*EventTypeID*, Gender, Distance, AgeGroup)
Fixture(*FixtureID*, FixtureDate, LocationName)
EventAtFixture(*FixtureID*, *EventTypeID*)
EventEntry(*FixtureID*, *EventTypeID*, *AthleteID*)
A list is to be produced of the names of all athletes who are competing in the fixture
that is taking place on 17/09/18. The list must include the Surname, Forename and
DateOfBirth of these athletes and no other details. The list should be presented in
alphabetical order by Surname.
Write an SQL query to produce the list.
I understand that you could do this two ways, one using a WHERE clause and the other using the INNER JOIN clause. However, I am wondering if the order matters when linking the tables.
First exemplar solution:
SELECT Surname, Forename, DateOfBirth
FROM Athlete, EventEntry, Fixture
WHERE FixtureDate = "17/09/2018"
AND Athlete.AthleteID = EventEntry.AthleteID
AND EventEntry.FixtureID = Fixture.FixtureID
ORDER BY Surname
Here is the first exemplar solution, would it still be correct if I was to switch the order of the tables in the WHERE clause, for example:
WHERE FixtureDate = "17/09/2018"
AND EventEntry.AthleteID = Athlete.AthleteID
AND Fixture.FixtureID = EventEntry.FixtureID
I have the same question for the INNER JOIN clause to, here is the second exemplar solution:
SELECT Surname, Forename, DateOfBirth
FROM Athlete
INNER JOIN EventEntry ON Athlete.AthleteID = EventEntry.AthleteID
INNER JOIN Fixture ON EventEntry.FixtureID = Fixture.FixtureID
WHERE FixtureDate = "17/09/2018"
ORDER BY Surname
Again, would it be correct if I used this order instead:
INNER JOIN EventEntry ON Fixture.FixtureID = EventEntry.FixtureID
If the order does matter, could somebody explain to me why it is in the order shown in the examples?
Some advice:
Never use commas in the FROM clause. Always use proper, explicit, standard JOIN syntax.
Use table aliases that are abbreviations for the table names.
Use standard date formats!
Qualify all column names.
Then, the order of the comparisons doesn't matter for equality. I would recommend using a canonical ordering.
So, the query should look more like:
SELECT a.Surname, a.Forename, a.DateOfBirth
FROM Athlete a INNER JOIN
EventEntry ee
ON a.AthleteID = ee.AthleteID INNER JOIN
Fixture f
ON ee.FixtureID = f.FixtureID
WHERE a.FixtureDate = '2018-09-17'
ORDER BY a.Surname;
I am guessing that all the columns in the SELECT come from Athlete. If that is not true, then adjust the table aliases.
There are lots of stylistic conventions for SQL and #gordonlinoff's answer mentions some of the perennial ones.
There are a few answers to your question.
The most important is that (notionally) SQL is a declarative language - you tell it what you want it to do, not how to do it. In a procedural language (like C, or Java, or PHP), the order of execution really matters - the sequence of instructions is part of the procedure. In a declarative language, the order doesn't really matter.
This wasn't always totally true - older query optimizers seemed to like the more selective where clauses earlier on in the statement for performance reasons. I haven't seen that for a couple of decades now, so assume that's not really a thing.
Because order doesn't matter, but correctly understanding the intent of a query does, many SQL developers emphasize readability. That's why we like explicit join syntax, and meaningful aliases. And for readability, the sequence of instructions can help. I favour starting with the "most important" table, usually the one from which you're selecting most columns, and then follow a logical chain of joins from one table to the next. This makes it easier to follow the logic.
When you use inner joins order does not matter as long as the prerequisite table is above/before. At your example both joins start from table Athlete so order doesn't matter. If however this very query is found starting from EventEntry (for any reason), then you must join at Athlete at the first inner else you cannot join to Fixture. As recommended, it is best to use standard join syntax and preferable place all inner joins before all lefts. If you cant then you need to review because the left you need to put inside the group of inner joins will probably behave like an inner join. That is because an inner below uses the left table else you could place it below the inner block. So when it comes to null the left will be ok but the inner below will cut the record.
When however the above cases do not exist/affect order and all inner joins can be placed at any order, only performance matters. Usually table with high cardinality on top perform better while there are cases where the opposite works better. So if the order is free you may try higher to lower cardinality tables ordering or the opposite - whatever works faster.
Clarifying: As prerequisite table i call the table needed by the joined table by condition: ... join B on [whatever] join C on c.id=b.cid - here table B is prerequisite for table C.
I mention left joins because while the question is about inner order, when joins are mixed (inners and lefts)then order of joins alone is important (to be all above) as may affect query logic:
... join B on [whatever] left join C on c.id=b.cid join D on D.id = C.did
At the above example the left join sneaks into the inner joins order. We cannot order it after D because it is prerequisite for D. For records however where condition c.id=b.cid is not true the entire B table row turns null and then the entire result row (B+C+D) turns off the results because of D.id = C.did condition of the following inner join. This example needs review as the purpose of left join evaporates by the following (next on order) inner join. Concluding, the order of inner joins when mixed with lefts is better to be on top without any left joins interfering.

Joins and Subqueries

I am aware that correlated subqueries use "where" clause and not joins.
But I wonder if "where" clause and inner join can have the same outcome then why can't we use these queries with joins?
For example,
SELECT FirstName, LastName, (SELECT COUNT(O.Id) FROM [Order] O WHERE O.CustomerId = C.Id) As OrderCount
FROM Customer C
Now, why can't we write down this like below?
SELECT FirstName, LastName, (SELECT COUNT(O.Id) FROM [Order] O Inner Join
C On O.CustomerId = C.Id) As OrderCount
FROM Customer C
I know very well about SQL and worked quiet on that but I am just looking for a clear technical explanation.
Thanks.
This is your query:
SELECT
FirstName,
LastName,
(
SELECT COUNT(O.Id)
FROM [Order] O
INNER JOIN C On O.CustomerId = C.Id
) AS OrderCount
FROM Customer C;
It is invalid, because in the sub query you are selecting from C.
This is a bit complicated to explain. In a query, we deal with tables and table rows. E.g.:
select person.name from person;
FROM person means "from the table person". person.name means "a person's name", so it is referring to a row. It would be great if we could write:
select person.name from persons;
but SQL doesn't know about singular and plural in your language, so this is not possible.
In your query FROM Customer C means "from the customer table, which I'm going to call C for short". But in the rest of the query including the sub query it is one customer row the C refers to. So you cannot say INNER JOIN C, because you can only join to a table, not a table row.
One might try to make this clear by using plural names for tables and singular names as table aliases. If you'd make it a habit, you'd have FROM Customers Customer in your main query and INNER JOIN Customer in your inner query, and you'd notice from your habits, that you cannot have a singular in the FROM clause. But well, one gets quickly accustomed to that double meaning (row and table) of a table name in a query, so this would just be kind of over-defensive, and we'll rather use alias names to get queries shorter and more readable, just as you are doing it with abbreviating customer to c.
But yes, you can use joins instead of sub queries in the SELECT clause. Either move the sub query to the FROM clause:
SELECT
c.firstname,
c.lastname,
COALESCE(o.ordercount, 0) AS ordercount
FROM customer c
LEFT JOIN
(
SELECT customerid, COUNT(*) AS ordercount
FROM [order]
GROUP BY customerid
) o ON o.customerid = c.id;
Or join without a sub query:
SELECT
c.firstname,
c.lastname,
COUNT(o.customerid) AS ordercount
FROM customer c
LEFT JOIN [order] o ON o.customerid = c.id
GROUP BY c.firstname, c.lastname;
The two queries are functionally equivalent. SQL (in the context of queries) is a declarative language, which means it works by DEFINING WHAT you want to achieve, not HOW to achieve it. So, at the abstract algebrical level, between the two queries there is absolutely no difference. (*)
However, because SQL does not work in the metaphysical realm of algebra but in the real world where the declarative language of SQL needs to be transposed in a procedural sequence of operations: it is much easier for me to decide the two queries are equivalent than for the RMDB of your choice. Computing the closure of the SQL declarative query can be incredibly computationally difficult. This is done by what is usually called the "query optimizer", which has not only the function of "understanding" the relational algebra but also of finding the probabilistically best way to implement it procedurally. Therefore, depending on the accuracy of the optimizer, the intricacy of your schema and query and the amount of computational resources the optimizer allocates on closing and optimizing the execution plan, the actual execution plans for the two otherwise equivalent queries can be different. You will still get the same results (as long as you stay in the declarative realm and don't use any NOW(), RAND() or other volatile state semantics), but one plan way may be faster, another may be slower. Also the order of results may be different, where ORDER BY is missing or equivocal.
Note: your join can be rewritten this way because it involves an aggregate on a side join. Not all joins can be transposed using subqueries, but there are plenty of situations of other queries that are equivalent although expressed differently. My answer is absolutely generic for any mathematically equivalent queries. See also explanation below.
(*) Queries equivalence also depends on schema. One usual enemy of common sense is NULL values: while a join will filter out null values if there is any condition on them, aggregates will behave in variuos other ways: SUM will be null, MAX/MIN will ignore nulls, COUNT will count anything, COUNT(DISTINCT) nobody knows what will do, etc.

Sub query select statment vs inner join

I'm confused about these two statements who's faster and more common to use and best for memory
select p.id, p.name, w.id, w.name
from person p
inner join work w on w.id = p.wid
where p.id in (somenumbers)
vs
select p.id, p.name, (select id from work where id=p.wid) , (select name from work where id=p.wid)
from person p
where p.id in (somenumbers)
The whole idea of this is that if I have I huge database and I want to make inner join it will take memory and less performance to johin work table and person table but the sub query select statments it will only select one statment at the time so which is the best here
First, the two queries are not the same. The first filters out any rows that have no matching rows in work.
The equivalent first query uses a left join:
select p.id, p.name, w.id, w.name
from person p left join
work w
on w.id = p.wid
where p.id in (somenumbers);
Then, the second query can be simplified to:
select p.id, p.name, p.wid,
(select name from work where w.id = p.wid)
from person p
where p.id in (somenumbers);
There is no reason to look up the id in work when it is already present in person.
If you want optimized queries, then you want indexes on person(id, wid, name) and work(id, name).
With these indexes, the two queries should have basically the same performance. The subquery will use the index on work for fetching the rows from work and the where clause will use the index on person. Either query should be fast and scalable.
The subqueries in your second example will execute once for every row, which will perform badly. That said, some optimizers may be able to convert it to a join for you - YMMV.
A good rule to follow in general is: much prefer joins to subqueries.
joins give better performance as comparison with sub-query .if there is join on Int column or have index on join column gives best performance .
select p.id, p.name, w.id, w.name
from person p
inner join work w on w.id = p.wid
where p.id in (somenumbers)
It really depends on how you want to optimaze the query (includie but not limited to add/removing/reordering the index),
I found the setup which makes join soars might let subquery suffer, the opposite may also be true. Thus there is not that much point to compare them with the same setup.
I choose to use and optimize with join. In my experince join at its best condition setup, rarely loses to subquery, but a lot eaiser to read.
When the vendor stuff an extreme load of queries with subqueries to the system. Unless the performance start to crawl, due to my other work's query optimization, it simply doesn't worth the effort to change them.

Nested Query or Joins

I have heard joins should be preferred over nested queries. Is it true in general? Or there might be scenarios where one would be faster than other:
for e.g. which is more efficient way to write a query?:
Select emp.salary
from employee emp
where emp.id = (select s.id from sap s where s.id = 111)
OR
Select emp.salary
from employee emp
INNER JOIN sap s ON emp.id = s.id
WHERE s.id = 111
I have heard joins should be preferred over nested queries. Is it true in general?
It depends on the requirements, and the data.
Using a JOIN risks duplicating the information in the resultset for the parent table if there are more than one child records related to it, because a JOIN returns the rows that match. Which means if you want unique values from the parent table while using JOINs, you need to look at using either DISTINCT or a GROUP BY clause. But none of this is a concern if a subquery is used.
Also, subqueries are not all the same. There's the straight evaluation, like your example:
where emp.id = (select s.id from sap s where s.id = 111)
...and the IN clause:
where emp.id IN (select s.id from sap s where s.id = 111)
...which will match any of the value(s) returned by the subquery when the straight evaluation will throw an error if s.id returns more than one value. But there's also the EXISTS clause...
WHERE EXISTS(SELECT NULL
FROM SAP s
WHERE emp.id = s.id
AND s.id = 111)
The EXISTS is different in that:
the SELECT clause doesn't get evaluated - you can change it to SELECT 1/0, which should trigger a divide-by-zero error but won't
it returns true/false; true based on the first instance the criteria is satisfied so it's faster when dealing with duplicates.
unlike the IN clause, EXISTS supports comparing two or more column comparisons at the same time, but some databases do support tuple comparison with the IN.
it's more readable
If the queries are logically equivalent, then the query optimizer should be able to make the same (best) execution plan from each one. In that case, query style should support what can be understood the best (that's subqueries for me).
It is much faster (and easier to write) to join two tables on an index than to run two separate queries (even a subquery).