Sub query select statment vs inner join - sql

I'm confused about these two statements who's faster and more common to use and best for memory
select p.id, p.name, w.id, w.name
from person p
inner join work w on w.id = p.wid
where p.id in (somenumbers)
vs
select p.id, p.name, (select id from work where id=p.wid) , (select name from work where id=p.wid)
from person p
where p.id in (somenumbers)
The whole idea of this is that if I have I huge database and I want to make inner join it will take memory and less performance to johin work table and person table but the sub query select statments it will only select one statment at the time so which is the best here

First, the two queries are not the same. The first filters out any rows that have no matching rows in work.
The equivalent first query uses a left join:
select p.id, p.name, w.id, w.name
from person p left join
work w
on w.id = p.wid
where p.id in (somenumbers);
Then, the second query can be simplified to:
select p.id, p.name, p.wid,
(select name from work where w.id = p.wid)
from person p
where p.id in (somenumbers);
There is no reason to look up the id in work when it is already present in person.
If you want optimized queries, then you want indexes on person(id, wid, name) and work(id, name).
With these indexes, the two queries should have basically the same performance. The subquery will use the index on work for fetching the rows from work and the where clause will use the index on person. Either query should be fast and scalable.

The subqueries in your second example will execute once for every row, which will perform badly. That said, some optimizers may be able to convert it to a join for you - YMMV.
A good rule to follow in general is: much prefer joins to subqueries.

joins give better performance as comparison with sub-query .if there is join on Int column or have index on join column gives best performance .
select p.id, p.name, w.id, w.name
from person p
inner join work w on w.id = p.wid
where p.id in (somenumbers)

It really depends on how you want to optimaze the query (includie but not limited to add/removing/reordering the index),
I found the setup which makes join soars might let subquery suffer, the opposite may also be true. Thus there is not that much point to compare them with the same setup.
I choose to use and optimize with join. In my experince join at its best condition setup, rarely loses to subquery, but a lot eaiser to read.
When the vendor stuff an extreme load of queries with subqueries to the system. Unless the performance start to crawl, due to my other work's query optimization, it simply doesn't worth the effort to change them.

Related

Join after Group by performance

Join tables and then group by multiple columns (like title) or group rows in sub-query and then join other tables?
Is the second method slow because of lack of indexes after grouping? Should I order rows manually for second method to trigger merge join instead of nested loop?
How to do it properly?
This is the first method. Became quite a mess cause of contragent_title and product_title are required to be in group by for strict mode. And I work with strict group by mode only.
SELECT
s.contragent_id,
s.contragent_title,
s.product_id AS sort_id,
s.product_title AS sort_title,
COALESCE(SUM(s.amount), 0) AS amount,
COALESCE(SUM(s.price), 0) AS price,
COALESCE(SUM(s.discount), 0) AS discount,
COUNT(DISTINCT s.product_id) AS sorts_count,
COUNT(DISTINCT s.contragent_id) AS contragents_count,
dd.date,
~grouping(dd.date, s.contragent_id, s.product_id) :: bit(3) AS mask
FROM date_dimension dd
LEFT JOIN (
SELECT
s.id,
s.created_at,
s.contragent_id,
ca.title AS contragent_title,
p.id AS product_id,
p.title AS product_title,
sp.amount,
sp.price,
sp.discount
FROM sales s
LEFT JOIN sold_products sp
ON s.id = sp.sale_id
LEFT JOIN products p
ON sp.product_id = p.id
LEFT JOIN contragents ca
ON s.contragent_id = ca.id
WHERE s.created_at BETWEEN :caf AND :cat
AND s.plant_id = :plant_id
AND (s.is_cache = :is_cache OR :is_cache IS NULL)
AND (sp.product_id = :sort_id OR :sort_id IS NULL)
) s ON dd.date = date(s.created_at)
WHERE (dd.date BETWEEN :caf AND :cat)
GROUP BY GROUPING SETS (
(dd.date, s.contragent_id, s.contragent_title, s.product_id, s.product_title),
(dd.date, s.contragent_id, s.contragent_title),
(dd.date)
)
This is an example of what you are talking about:
Join, then aggregate:
select d.name, count(e.employee_id) as number_of_johns
from departments d
left join employees e on e.department_id = e.department_id
where e.first_name = 'John'
group by d.department_id;
Aggregate then join:
select d.name, coalesce(number_of_johns, 0) as number_of_johns
from departments d
left join
(
select department_id, count(*) as number_of_johns
from employees
where first_name = 'John'
group by department_id
) e on e.department_id = e.department_id;
Question
You want to know whether one is faster than the other, assuming the latter may be slower for loosing the direct table links via IDs. (While every query result is a table, and hence the subquery result also is, it is no physical table stored in the database and has hence no indexes.)
Thinking and guessing
Let's see what the queries do:
The first query is supposed to join all departments and employees and only keep the Johns. How will it do that? It will probably find all Johns first. If there is an index on employees(first_name), it will probably use that, otherwise it will read the full table. Then find the counts by department_id. If the index I talked about even contained the department (index on employees(first_name, department_id), the DBMS would now have the Johns presorted and could just count. If it doesn't the DBMS may order the employee rows now and count then or use some other method for counting. And if we were looking for two names instead of just one, the compound index would be of little or no benefit compared to the mere index on first_name. At last the DBMS will read all departments and join the found counts. But our count result rows are not a table, so there is no index we can use. Anyway, the DBMS will just either just loop over the results or have them sorted anyway, so the join is easy peasy. So far from what I think the DBMS will do. There are a lot of ifs in my assumptions and the DBMS may still have other methods to choose from or won't use an index at all because the tables are so small anyway, or whatever.
The second query, well, same same.
Answer
You see, we can only guess how a DBMS will approach joins with aggregations. It may or may not come up with the same execution plan for the two queries. A perfect DBMS would create the same plan, as the two queries do the same thing. A not so perfect DBMS may create different plans, but which is better we can hardly guess. Let's just rely on the DBMS to do a good job concerning this.
I am using Oracle mainly and just tried about the same thing as shown with two of my tables. It shows exactly the same execution plan for both queries. PostgreSQL is also a great DBMS. Nothing to worry about, I'd say :-)
Better focus on writing readable, maintainable queries. With these small queries there is no big difference; the first one is a tad mor compact and easy to grab, the second a tad more sophisticated.
I, personally, prefer the second query. It is good style to aggregate before joining and such queries can be easily extended with further aggregations, which can be much more difficult with the first one. Only if I ran into performance issues, I would try a different approach.

SQL Question: Does the order of the WHERE/INNER JOIN clause when interlinking table matter?

Exam Question (AQA A-level Computer Science):
[Primary keys shown by asterisks]
Athlete(*AthleteID*, Surname, Forename, DateOfBirth, Gender, TeamName)
EventType(*EventTypeID*, Gender, Distance, AgeGroup)
Fixture(*FixtureID*, FixtureDate, LocationName)
EventAtFixture(*FixtureID*, *EventTypeID*)
EventEntry(*FixtureID*, *EventTypeID*, *AthleteID*)
A list is to be produced of the names of all athletes who are competing in the fixture
that is taking place on 17/09/18. The list must include the Surname, Forename and
DateOfBirth of these athletes and no other details. The list should be presented in
alphabetical order by Surname.
Write an SQL query to produce the list.
I understand that you could do this two ways, one using a WHERE clause and the other using the INNER JOIN clause. However, I am wondering if the order matters when linking the tables.
First exemplar solution:
SELECT Surname, Forename, DateOfBirth
FROM Athlete, EventEntry, Fixture
WHERE FixtureDate = "17/09/2018"
AND Athlete.AthleteID = EventEntry.AthleteID
AND EventEntry.FixtureID = Fixture.FixtureID
ORDER BY Surname
Here is the first exemplar solution, would it still be correct if I was to switch the order of the tables in the WHERE clause, for example:
WHERE FixtureDate = "17/09/2018"
AND EventEntry.AthleteID = Athlete.AthleteID
AND Fixture.FixtureID = EventEntry.FixtureID
I have the same question for the INNER JOIN clause to, here is the second exemplar solution:
SELECT Surname, Forename, DateOfBirth
FROM Athlete
INNER JOIN EventEntry ON Athlete.AthleteID = EventEntry.AthleteID
INNER JOIN Fixture ON EventEntry.FixtureID = Fixture.FixtureID
WHERE FixtureDate = "17/09/2018"
ORDER BY Surname
Again, would it be correct if I used this order instead:
INNER JOIN EventEntry ON Fixture.FixtureID = EventEntry.FixtureID
If the order does matter, could somebody explain to me why it is in the order shown in the examples?
Some advice:
Never use commas in the FROM clause. Always use proper, explicit, standard JOIN syntax.
Use table aliases that are abbreviations for the table names.
Use standard date formats!
Qualify all column names.
Then, the order of the comparisons doesn't matter for equality. I would recommend using a canonical ordering.
So, the query should look more like:
SELECT a.Surname, a.Forename, a.DateOfBirth
FROM Athlete a INNER JOIN
EventEntry ee
ON a.AthleteID = ee.AthleteID INNER JOIN
Fixture f
ON ee.FixtureID = f.FixtureID
WHERE a.FixtureDate = '2018-09-17'
ORDER BY a.Surname;
I am guessing that all the columns in the SELECT come from Athlete. If that is not true, then adjust the table aliases.
There are lots of stylistic conventions for SQL and #gordonlinoff's answer mentions some of the perennial ones.
There are a few answers to your question.
The most important is that (notionally) SQL is a declarative language - you tell it what you want it to do, not how to do it. In a procedural language (like C, or Java, or PHP), the order of execution really matters - the sequence of instructions is part of the procedure. In a declarative language, the order doesn't really matter.
This wasn't always totally true - older query optimizers seemed to like the more selective where clauses earlier on in the statement for performance reasons. I haven't seen that for a couple of decades now, so assume that's not really a thing.
Because order doesn't matter, but correctly understanding the intent of a query does, many SQL developers emphasize readability. That's why we like explicit join syntax, and meaningful aliases. And for readability, the sequence of instructions can help. I favour starting with the "most important" table, usually the one from which you're selecting most columns, and then follow a logical chain of joins from one table to the next. This makes it easier to follow the logic.
When you use inner joins order does not matter as long as the prerequisite table is above/before. At your example both joins start from table Athlete so order doesn't matter. If however this very query is found starting from EventEntry (for any reason), then you must join at Athlete at the first inner else you cannot join to Fixture. As recommended, it is best to use standard join syntax and preferable place all inner joins before all lefts. If you cant then you need to review because the left you need to put inside the group of inner joins will probably behave like an inner join. That is because an inner below uses the left table else you could place it below the inner block. So when it comes to null the left will be ok but the inner below will cut the record.
When however the above cases do not exist/affect order and all inner joins can be placed at any order, only performance matters. Usually table with high cardinality on top perform better while there are cases where the opposite works better. So if the order is free you may try higher to lower cardinality tables ordering or the opposite - whatever works faster.
Clarifying: As prerequisite table i call the table needed by the joined table by condition: ... join B on [whatever] join C on c.id=b.cid - here table B is prerequisite for table C.
I mention left joins because while the question is about inner order, when joins are mixed (inners and lefts)then order of joins alone is important (to be all above) as may affect query logic:
... join B on [whatever] left join C on c.id=b.cid join D on D.id = C.did
At the above example the left join sneaks into the inner joins order. We cannot order it after D because it is prerequisite for D. For records however where condition c.id=b.cid is not true the entire B table row turns null and then the entire result row (B+C+D) turns off the results because of D.id = C.did condition of the following inner join. This example needs review as the purpose of left join evaporates by the following (next on order) inner join. Concluding, the order of inner joins when mixed with lefts is better to be on top without any left joins interfering.

Using COUNT (DISTINCT..) when also using INNER JOIN to join 3 tables but Postgres keeps erroring

I need to use INNER JOINs to get a series of information and then I need to COUNT this info. I need to be able to "View all courses and the instructor taking them, the capacity of the course, and the number of members currently booked on the course."
To get all the info I have done the following query:
SELECT
C.coursename, Instructors.fname, Instructors.lname,C.maxNo, membercourse.memno
FROM Courses AS C
INNER JOIN Instructors ON C.instructorNo = Instructors.instructorNo
INNER JOIN Membercourse ON C.courseID = Membercourse.courseID;
but no matter where I put the COUNT it always tells me that whatever is outside the COUNT should be in the GROUP BY
I have worked out how to COUNT/GROUP BY the necessary info e.g.:
SELECT courseID, COUNT (DISTINCT MC.memno)
FROM Membercourse AS MC
GROUP BY MC.courseID;
but I don't know how to combine the two!
I think what you're looking for is a subquery. I'm a SQL-Server guy (not postgresql) but the concept looks to be almost identical after some crash-course postgresql googling.
Anyway, basically, when you write a SELECT statement, you can use a subquery instead of an actual table. So your SQL would look something like:
select count(*)
from
(
select stuff from table
inner join someOtherTable
)
... hopefully that makes sense. Instead of trying to write one big query where you're doing both the inner join and count, you're writing two: an inner one that gets your inner-join'ed data, and then an outer one to actually count the rows.
EDIT: To help explain a bit more on the thought process behind subqueries.
Subqueries are a way of logically breaking down the steps/processes on the data. Instead of trying to do everything in one big step, you do it in steps.
In this case, what's step one? It's to get a combined data source for your combined, inner-join'ed data.
Step 1: Write the Inner Join query
SELECT
C.coursename, Instructors.fname, Instructors.lname,C.maxNo,
membercourse.memno
FROM Courses AS C
INNER JOIN Instructors ON C.instructorNo = Instructors.instructorNo
INNER JOIN Membercourse ON C.courseID = Membercourse.courseID;
Okay, now, what next?
Well, let's say we want to get a count of how many entries there are for each 'memno' in that result above.
Instead of trying to figure out how to modify that query above, we instead use it as a data source, like it was a table itself.
Step 2 - Make it A Subquery
select * from
(
SELECT
C.coursename, Instructors.fname, Instructors.lname,C.maxNo,
membercourse.memno
FROM Courses AS C
INNER JOIN Instructors ON C.instructorNo = Instructors.instructorNo
INNER JOIN Membercourse ON C.courseID = Membercourse.courseID
) mySubQuery
Step 3 - Modify your outer query to get the data you want.
Well, we wanted to group by 'memno', and get the count, right? So...
select memno, count(*)
from
(
-- all that same subquery stuff
) mySubQuery
group by memno
... make sense? Once you've got your subquery written out, you don't need to worry about it any more - you just treat it like a table you're working with.
This is actually incredibly important, and makes it much easier to read more intricate queries - especially since you can name your subqueries in a way that explains what the subquery represents data-wise.
There are many ways to solve this, such using Window Functions and so on. But you can also achieve it using a simple subquery:
SELECT
C.coursename,
Instructors.fname,
Instructors.lname,
C.maxNo,
(SELECT
COUNT(*)
FROM
membercourse
WHERE
C.courseID = Membercourse.courseID) AS members
FROM
Courses AS C
INNER JOIN Instructors ON C.instructorNo = Instructors.instructorNo;

Excluding rows from result set, LEFT JOIN and EXCEPT

When you have two tables, and want to exclude rows from the second one, there are a multitude of options including EXISTS, NOT IN, LEFT JOIN and EXCEPT.
I've always used left join:
select N.ProductID from NewProducts N
left join Products P on P.ProductID = N.ProductID
where P.ProductID is null
Now I'm thinking it's cleaner to to use EXCEPT:
select ProductID from NewProducts
except
select ProductID from Products
Are there performance issues of using EXCEPT?
You can check execution plan and SQL profiler to choose the suitable query.
But, for me, NOT EXISTS is good. Reference here
The answer to your question is all up to you, depending on how large the data.
You can use any of that (EXISTS, NOT IN, LEFT JOIN and EXCEPT.) depending on your requirement.
you said that you always use LEFT JOIN , and that is good.. because joining the two tables will minimize the execution time of the query, especially when you are holding large amount of data.
JOIN is advisable but it is always depends on you.
You can see the difference of execution time using the execution plan of sql.

Optimizing for an OR in a Join in MySQL

I've got a pretty complex query in MySQL that slows down drastically when one of the joins is done using an OR. How can I speed this up? the relevant join is:
LEFT OUTER JOIN publications p ON p.id = virtual_performances.publication_id
OR p.shoot_id = shoots.id
Removing either condition in the OR decreases the query time from 1.5s to 0.1s. There are already indexes on all the relevant columns I can think of. Any ideas? The columns in use all have indexes on them. Using EXPLAIN I've discovered that once the OR comes into play MySQL ends up not using any of the indexes. Is there a special kind of index I can make that it will use?
This is a common difficulty with MySQL. Using OR baffles the optimizer because it doesn't know how to use an index to find a row where either condition is true.
I'll try to explain: Suppose I ask you to search a telephone book and find every person whose last name is 'Thomas' OR whose first name is 'Thomas'. Even though the telephone book is essentially an index, you don't benefit from it -- you have to search through page by page because it's not sorted by first name.
Keep in mind that in MySQL, any instance of a table in a given query can make use of only one index, even if you have defined multiple indexes in that table. A different query on that same table may use another index if the optimizer reasons that it's more helpful.
One technique people have used to help in situations like your is to do a UNION of two simpler queries that each make use of separate indexes:
SELECT ...
FROM virtual_performances v
JOIN shoots s ON (...)
LEFT OUTER JOIN publications p ON (p.id = v.publication_id)
UNION ALL
SELECT ...
FROM virtual_performances v
JOIN shoots s ON (...)
LEFT OUTER JOIN publications p ON p.shoot_id = s.id;
Make two joins on the same table (adding aliases to separate them) for the two conditions, and see if that is faster.
select ..., coalesce(p1.field, p2.field) as field
from ...
left join publications p1 on p1.id = virtual_performances.publication_id
left join publications p2 on p2.shoot_id = shoots.id
You can also try something like this on for size:
SELECT * FROM tablename WHERE id IN
(SELECT p.id FROM tablename LEFT OUTER JOIN publications p ON p.id IN virtual_performances.publication_id)
OR
p.id IN
(SELECT p.id FROM tablename LEFT OUTER JOIN publications p ON p.shoot_id = shoots.id);
It's a bit messier, and won't be faster in every case, but MySQL is good at selecting from straight data sets, so repeating yourself isn't so bad.