SQL Composite key grouping issue - sql

I've got a very frustrating SQL issue which i can't for the life of me solve with a derived query returning a composite key but also performing a MIN() aggregate function on another field within that table. If i was performing the MIN() on one of the composite keys it would be easy, but since i need to return both keys and perform the MIN() function as well to the outer query i can't work out how to do this. The entire query looks like this:
SELECT
p.name as productname
,tmp.packageid
,tmp.price
,ppk2.packageoptionid
,ppk2.selcomproductid
FROM (
SELECT ppk.productid, ppk.packageid, MIN(ppk.price) as price
FROM product_package ppk
INNER JOIN package pk ON ppk.packageid = pk.id
INNER JOIN [plan] pl ON pk.planid = pl.id
WHERE pk.networkid = 1
GROUP BY ppk.productid, ppk.packageid
) tmp
INNER JOIN product_package ppk2 ON (
ppk2.productid = tmp.productid
AND ppk2.packageid = tmp.packageid
)
INNER JOIN product p ON (p.id = ppk2.productid)
WHERE p.isenabled = 1;
Current Results:
--------------------------------------
productid | packageid | price
1 500 0
1 501 19.95
1 502 29.95
2 501 0
3 500 15
3 504 39.95
Desired Results:
--------------------------------------
productid | packageid | price
1 500 0
2 501 0
3 500 15
The derived query "tmp" is where my issue lies as i need a unique rows back for each product/package combination with the lowest price, before joining onto the outer tables.
Any help would be greatly appreciated!

I've used this trick whenever I need a sub query along with the smallest of something. The idea is to combine the value and key together with the value in the most significant bits and take the min of that. Then take it apart in the outer select.
The best way to combine to values depends on what RDBMS you're using. You don't mention which one you're using so I'm just providing pseudo code:
select ..., (tmp.c >> 32) price
from
(select productid, min((price << 32) | packageid) c
from product_package
where networkid=1
group by productid) tmp
inner join product_package ppk on ppk.productid=tmp.productid
and ppk.packageid=(tmp.c & 0xFFFFFFFF)
inner join product p on p.id=ppk.productid
where p.isenabled=1
<< 32 means shift the value 32 bits to the left and | is the bitwise "or". So this is assuming packageid is defined as a 32bit integer (or number(4)). The & 0xFFFFFFFF is the bitwise "and" and the hex value for 32 bits used to mask and return just packageid.
Depending on your RDBMS you may need to find its specific syntax for these things or if they're aren't supported you can use plain math - << 32 is equivalent to multiplying by 4294967296 and & 0xFFFFFFFF to dividing by 4294967296. If you're using MSSQL you can use convert(binary,price)+convert(binary,packageid) to combine them and substring(..) to separate.

Well, I don't know the data you actually have in your table. I just have the data your query returns. You didn't answer to my comment asking for a sample of the data of your table and the DBMS you were using.
However, assuming the current data of your table is the one that comes out of your query, the following query will give you the "Desired Result" you've specified:
select t1.* from t t1
left join t t2
on t1.productid = t2.productid and t1.details > t2.details
where t2.details is null
In table words, the query turns this:
+-----------+-----------+---------+
| PRODUCTID | PACKAGEID | DETAILS |
+-----------+-----------+---------+
| 1 | 500 | 0 |
| 1 | 501 | 20 |
| 1 | 502 | 30 |
| 2 | 501 | 0 |
| 3 | 500 | 15 |
| 3 | 504 | 40 |
+-----------+-----------+---------+
Into this:
+-----------+-----------+---------+
| PRODUCTID | PACKAGEID | DETAILS |
+-----------+-----------+---------+
| 1 | 500 | 0 |
| 2 | 501 | 0 |
| 3 | 500 | 15 |
+-----------+-----------+---------+
Let me know if it's clear or not.

Easy (read: expensive) way: build two views: One that gets just the minimum ppk.price of each productid WHERE pk.networkid = 1, and group that by productid. Call it Product_MinPrice_VIEW or whatever.
Build a second view, Product_VIEW, that replaces all that sub-SELECT INNER JOIN work you're trying to get away with via an INNER JOIN on the results of the Product_MinPrice_VIEW you just made.
I swear, wrangling with sub-SELECTS, HAVINGS and GROUP-BYs is tedious and error prone. I can't stand it sometimes. Hopefully, this will get you far enough to develop a solution that can be later optimized and made more correct.
FINAL ANSWER
I have an extremely similar problem with the application I'm working on, and in the mean time (while I hit this site up for a better answer), I just passed the buck, and wrote some application-level code to deal with any duplicates, and let the program's logic find the true minimum when encountered. Not pretty, but then again I don't have all day to try and figure it out!
I'm sorry my answer couldn't help you. Good luck!

Related

Oracle SQL: LEFT JOIN -> GROUP BY COUNT(*) without incorrect count amounts

I have this SQL query which is causing me to struggle a little bit.
SELECT c.Threat_Type, COUNT(*) FROM BORE.NormalToilet c
LEFT JOIN BORE.EnragedToilet p on p.Toilet_ID = c.Toilet_ID
GROUP BY c.Threat_Type
Here's the output:
+--------------+--------+
| Threat Type | Count |
+--------------+--------+
| Portable | 26 |
| Japanese | 1 |
| Toilet | 1 |
| Assassin | 3 |
+--------------+--------+
Now here's the problem with this query:
There are, correctly, 26 vulnerabilities in our portable toilet firmware.
There is, incorrectly listed, 1 vulnerability in our Japanese toilet... there are no vulnerabilities, only requested features. This should say 0, not 1.
Same problem as above: Toilet features have no vulnerabilities.
If I try an INNER JOIN, it doesn't include the fields with 1. This is the correct behavior (as there are no matches), however, I want it to say 0 instead of 1. Using a LEFT JOIN, it just says 1, which is heinously incorrect.
What am I doing wrong?
Credit to a_horse_with_no_name and vkp in the comments. If they do end up posting their own answer, feel free to vote to delete this answer.
Instead of count(*) which counts rows, you want count(p.Toilet_ID), which only counts the rows where p.Toilet_ID is not null.

Find spectators that have seen the same shows (match multiple rows for each)

For an assignment I have to write several SQL queries for a database stored in a PostgreSQL server running PostgreSQL 9.3.0. However, I find myself blocked with last query. The database models a reservation system for an opera house. The query is about associating the a spectator the other spectators that assist to the same events every time.
The model looks like this:
Reservations table
id_res | create_date | tickets_presented | id_show | id_spectator | price | category
-------+---------------------+---------------------+---------+--------------+-------+----------
1 | 2015-08-05 17:45:03 | | 1 | 1 | 195 | 1
2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 | 11 | 1 | 150 | 2
Spectators table
id_spectator | last_name | first_name | email | create_time | age
---------------+------------+------------+----------------------------------------+---------------------+-----
1 | gonzalez | colin | colin.gonzalez#gmail.com | 2014-03-15 14:21:30 | 22
2 | bequet | camille | bequet.camille#gmail.com | 2014-12-10 15:22:31 | 22
Shows table
id_show | name | kind | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
1 | madama butterfly | opera | 2015-09-05 | 19:30:00 | 21:30:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
2 | don giovanni | opera | 2015-09-12 | 19:30:00 | 21:45:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
So far I've started by writing a query to get the id of the spectator and the date of the show he's attending to, the query looks like this.
SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;
Could someone help me understand better the problem and hint me towards finding a solution. Thanks in advance.
So the result I'm expecting should be something like this
id_spectator | other_id_spectators
-------------+--------------------
1| 2,3
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
Note based on comments: Wanted to make clear that this answer may be of limited use as it was answered in the context of SQL-Server (tag was present at the time)
There is probably a better way to do it, but you could do it with the 'stuff 'function. The only drawback here is that, since your ids are ints, placing a comma between values will involve a work around (would need to be a string). Below is the method I can think of using a work around.
SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]
This will show you all other spectators that attended the same shows.
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
In other words, you want a list of ...
all spectators that have seen all the shows that a given spectator has seen (and possibly more than the given one)
This is a special case of relational division. We have assembled an arsenal of basic techniques here:
How to filter SQL results in a has-many-through relation
It is special because the list of shows each spectator has to have attended is dynamically determined by the given prime spectator.
Assuming that (d_spectator, id_show) is unique in reservations, which has not been clarified.
A UNIQUE constraint on those two columns (in that order) also provides the most important index.
For best performance in query 2 and 3 below also create an index with leading id_show.
1. Brute force
The primitive approach would be to form a sorted array of shows the given user has seen and compare the same array of others:
SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM (
SELECT id_spectator
FROM reservations r
WHERE id_spectator <> 1
GROUP BY 1
HAVING array_agg(id_show ORDER BY id_show)
#> (SELECT array_agg(id_show ORDER BY id_show)
FROM reservations
WHERE id_spectator = 1)
) sub;
But this is potentially very expensive for big tables. The whole table hast to be processes, and in a rather expensive way, too.
2. Smarter
Use a CTE to determine relevant shows, then only consider those
WITH shows AS ( -- all shows of id 1; 1 row per show
SELECT id_spectator, id_show
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
)
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM (
SELECT s.id_spectator, r.id_spectator AS other
FROM shows s
JOIN reservations r USING (id_show)
WHERE r.id_spectator <> s.id_spectator
GROUP BY 1,2
HAVING count(*) = (SELECT count(*) FROM shows)
) sub
GROUP BY 1;
#> is the "contains2 operator for arrays - so we get all spectators that have at least seen the same shows.
Faster than 1. because only relevant shows are considered.
3. Real smart
To also exclude spectators that are not going to qualify early from the query, use a recursive CTE:
WITH RECURSIVE shows AS ( -- produces exactly 1 row
SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
GROUP BY 1
)
, cte AS (
SELECT r.id_spectator, 1 AS idx
FROM shows s
JOIN reservations r ON r.id_show = s.shows[1]
WHERE r.id_spectator <> s.id_spectator
UNION ALL
SELECT r.id_spectator, idx + 1
FROM cte c
JOIN reservations r USING (id_spectator)
JOIN shows s ON s.shows[c.idx + 1] = r.id_show
)
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM shows s
JOIN cte c ON c.idx = s.ct -- has an entry for every show
GROUP BY 1;
Note that the first CTE is non-recursive. Only the second part is recursive (iterative really).
This should be fastest for small selections from big tables. Row that don't qualify are excluded early. the two indices I mentioned are essential.
SQL Fiddle demonstrating all three.
It sounds like you have one half of the total question--determining which id_shows a particular id_spectator attended.
What you want to ask yourself is how you can determine which id_spectators attended an id_show, given an id_show. Once you have that, combine the two answers to get the full result.
So the final answer I got, looks like this :
SELECT id_spectator, id_show,(
SELECT string_agg(to_char(A.id_spectator, '999'), ',')
FROM Reservations A
WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;
Which prints something like this:
id_spectator | id_show | other_id_spectators
-------------+---------+---------------------
1 | 1 | 1, 2, 9
1 | 14 | 1, 2
Which suits my needs, however if you have any improvements to offer, please share :) Thanks again everybody!

Performance difference: condition placed at INNER JOIN vs WHERE clause

Say I have a table order as
id | clientid | type | amount | itemid | date
---|----------|------|--------|--------|-----------
23 | 258 | B | 150 | 14 | 2012-04-03
24 | 258 | S | 69 | 14 | 2012-04-03
25 | 301 | S | 10 | 20 | 2012-04-03
26 | 327 | B | 54 | 156 | 2012-04-04
clientid is a foreign-key back to the client table
itemid is a foreign key back to an item table
type is only B or S
amount is an integer
and a table processed as
id | orderid | processed | date
---|---------|-----------|---------
41 | 23 | true | 2012-04-03
42 | 24 | true | 2012-04-03
43 | 25 | false | <NULL>
44 | 26 | true | 2012-04-05
I need to get all the rows from order that for the same clientid on the same date have opposing type values. Keep in mind type can only have one of two values - B or S. In the example above this would be rows 23 and 24.
The other constraint is that the corresponding row in processed must be true for the orderid.
My query so far
SELECT c1.clientid,
c1.date,
c1.type,
c1.itemid,
c1.amount,
c2.date,
c2.type,
c2.itemid,
c2.amount
FROM order c1
INNER JOIN order c2 ON c1.itemid = c2.itemid AND
c1.date = c2.date AND
c1.clientid = c2.clientid AND
c1.type <> c2.type AND
c1.id < c2.id
INNER JOIN processed p1 ON p1.orderid = c1.id AND
p1.processed = true
INNER JOIN processed p2 ON p2.orderid = c2.id AND
p2.processed = true
QUESTION: Keeping the processed = true as part of the join clause is slowing the query down. If I move it to the WHERE clause then the performance is much better. This has piqued my interest and I'd like to know why.
The primary keys and respective foreign key columns are indexed while the value columns (value, processed etc) aren't.
Disclaimer: I have inherited this DB structure and the performance difference is roughly 6 seconds.
The reason that you're seeing a difference is due to the execution plan that the planner is putting together, this is obviously different depending on the query (arguably, it should be optimising the 2 queries to be the same and this may be a bug). This means that the planner thinks it has to work in a particular way to get to the result in each statement.
When you do it within the JOIN, the planner will probably have to select from the table, filter by the "True" part, then join the result sets. I would imagine this is a large table, and therefore a lot of data to look through, and it can't use the indexes as efficiently.
I suspect that if you do it in a WHERE clause, the planner is choosing a route that is more efficient (ie. either index based, or pre filtered dataset).
You could probably make the join work as fast (if not faster) by adding an index on the two columns (not sure if included columns and multiple column indexes are supported on Postgres yet).
In short, the planner is the problem it is choosing 2 different routes to get to the result sets, and one of those is not as efficient as the other. It's impossible for us to know what the reasons are without the full table information and the EXPLAIN ANALYZE information.
If you want specifics on why your specific query is doing this, you'll need to provide more information. However the reason is the planner choosing different routes.
Additional Reading Material:
http://www.postgresql.org/docs/current/static/explicit-joins.html
Just skimmed, seems that the postgres planner doesn't re-order joins to optimise it. try changing the order of the joins in your statement to see if you then get the same performance... just a thought.

Access 2007 select first value of query results

I am running into a rather annoying thingy in Access (2007) and I am not sure if this is a feature or if I am asking for the impossible.
Although the actual database structure is more complex, my problem boils down to this:
I have a table with data about Units for specific years. This data comes from different sources and might overlap.
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 80 | 1 |
A | 2010 | 101 | 2 |
A | 2010 | 150 | 3 |
A | 2011 | 90 | 1 |
...
Now I would like the user to select certain sources, order them by priority and then extract one data value for each year.
For example, if the user selects source 1, 2 and 3 and orders them by (3, 1, 2), then I would like the following result:
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 150 | 3 |
A | 2011 | 90 | 1 |
I am able to order the initial table, based on a specific order. I do this with the following query
SELECT Unit, IYR, X1, Source
FROM TestTable
WHERE Source In (1,2,3)
ORDER BY Unit, IYR,
IIf(Source=3,1,IIf(Source=1,2,IIf(Source=2,3,4)))
This gives me the following intermediate result:
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 150 | 3 |
A | 2010 | 80 | 1 |
A | 2010 | 101 | 2 |
A | 2011 | 90 | 1 |
Next step is to only get the first value of each year. I was thinking to use the following query:
SELECT X.Unit, X.IYR, first(X.X1) as FirstX1
FROM (...) AS X
GROUP BY X.Unit, X.IYR
Where (…) is the above query.
Now Access goes bananas. Whatever order I give to the intermediate results, the result of this query is.
Unit | IYR | X1 |
--------------------
A | 2009 | 55 |
A | 2010 | 80 |
A | 2011 | 90 |
In other words, for year 2010 it shows the value of source 1 instead of 3. It seems that Access does not care about the ordering of the nested query when it applies the FIRST() function and sticks to the original ordering of the data.
Is this a feature of Access or is there a different way of achieving the desired results?
Ps: Next step would be to use a self join to add the source column to the results again, but I first need to resolve above problem.
Rather than use first it may be better to determine the MIN Priority and then join back e.g.
SELECT
t.UNIT,
t.IYR,
t.X1,
t.Source ,
t.PrioritySource
FROM
(SELECT
Unit,
IYR,
X1,
Source,
SWITCH ( [Source]=3, 1,
[Source]=1, 2,
[Source]=2, 3) as PrioritySource
FROM
TestTable
WHERE
Source In (1,2,3)
) as t
INNER JOIN
(SELECT
Unit,
IYR,
MIN(SWITCH ( [Source]=3, 1,
[Source]=1, 2,
[Source]=2, 3)) as PrioritySource
FROM
TestTable
WHERE
Source In (1,2,3)
GROUP BY
Unit,
IYR ) as MinPriortiy
ON t.Unit = MinPriortiy.Unit and
t.IYR = MinPriortiy.IYR and
t.PrioritySource = MinPriortiy.PrioritySource
which will produce this result (Note I include Source and priority source for demonstration purposes only)
UNIT | IYR | X1 | Source | PrioritySource
----------------------------------------------
A | 2009 | 55 | 1 | 2
A | 2010 | 150 | 3 | 1
A | 2011 | 90 | 1 | 2
Note the first subquery is to handle the fact that Access won't let you join on a Switch
Yes, FIRST() does use an arbitrary ordering. From the Access Help:
These functions return the value of a specified field in the first or
last record, respectively, of the result set returned by a query. If
the query does not include an ORDER BY clause, the values returned by
these functions will be arbitrary because records are usually returned
in no particular order.
I don't know whether FROM (...) AS X means you are using an ORDER BY inline (assuming that is actually possible) or if you are using a VIEW ('stored Query object') here but either way I assume the ORDER BY is being disregarded (because an ORDER BY should only apply to the final result).
The alternative is to use MIN() (or possibly MAX()).
This is the most concise way I have found to write such queries in Access that require pulling back all columns that correspond to the first row in a group of records that are ordered in a particular way.
First, I added a UniqueID to your table. In this case, it's just an AutoNumber field. You may already have a unique value in your table, in which case you can use that.
This will choose the row with a Source 3 first, then Source 1, then Source 2. If there is a tie, it picks the one with the higher X1 value. If there is a further tie, it is broken by the UniqueID value:
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.UniqueID=
(SELECT TOP 1 [UniqueID] FROM [TestTable]
WHERE t.IYR=IYR ORDER BY Choose([Source],2,3,1), X1 DESC, UniqueID)
This yields:
Unit IYR X1 Source UniqueID
A 2009 55 1 1
A 2010 150 3 4
A 2011 90 1 5
I recommend (1) you create an index on the IYR field -- this will dramatically increase your performance for this type of query, and (2) if you have a lot (>~100K) records, this isn't the best choice. I find it works quite well for tables in the 1-70K range. For larger datasets, I like to use my GroupIncrement function to partition each group (similar to SQL Server's ROW_NUMBER() OVER statement).
The Choose() function is a VBA function and may not be clear here. In your case, it sounds like there is some interactivity required. For that, you could create a second table called "Choices", like so:
Rank Choice
1 3
2 1
3 2
Then, you could substitute the following:
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.UniqueID=(SELECT TOP 1 [UniqueID] FROM
[TestTable] t2 INNER JOIN [Choices] c
ON t2.Source=c.Choice
WHERE t.IYR=t2.IYR ORDER BY c.[Rank], t2.X1 DESC, t2.UniqueID);
Indexing Source on TestTable and Choice on the Choices table may be helpful here, too, depending on the number of choices required.
Q:
Can you get this to work without the need for surrogate key? For
example what if the unique key is the composite of
{Unit,IYR,X1,Source}
A:
If you have a compound key, you can do it like this-- however I think that if you have a large dataset, it will totally kill the performance of the query. It may help to index all four columns, but I can't say for sure because I don't regularly use this method.
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.Unit & t.IYR & t.X1 & t.Source =
(SELECT TOP 1 Unit & IYR & X1 & Source FROM [TestTable]
WHERE t.IYR=IYR ORDER BY Choose([Source],2,3,1), X1 DESC, Unit, IYR)
In certain cases, you may have to coalesce some of the individual parts of the key as follows (though Access generally will coalesce values automatically):
t.Unit & CStr(t.IYR) & CStr(t.X1) & CStr(t.Source)
You could also use a query in your FROM statements instead of the actual table. The query itself would build a composite of the four fields used in the key, and then you'd use the new key name in the WHERE clause of the top SELECT statement, and in the SELECT TOP 1 [key] of the subquery.
In general, though, I will either: (a) create a new table with an AutoNumber field, (b) add an AutoNumber field, (c) add an integer and populate it with a unique number using VBA - this is useful when you get a MaxLocks error when trying to add an AutoNumber, or (d) use an already indexed unique key.

Join vs. sub-query

I am an old-school MySQL user and have always preferred JOIN over sub-query. But nowadays everyone uses sub-query, and I hate it; I don't know why.
I lack the theoretical knowledge to judge for myself if there is any difference. Is a sub-query as good as a JOIN and therefore is there nothing to worry about?
Sub-queries are the logically correct way to solve problems of the form, "Get facts from A, conditional on facts from B". In such instances, it makes more logical sense to stick B in a sub-query than to do a join. It is also safer, in a practical sense, since you don't have to be cautious about getting duplicated facts from A due to multiple matches against B.
Practically speaking, however, the answer usually comes down to performance. Some optimisers suck lemons when given a join vs a sub-query, and some suck lemons the other way, and this is optimiser-specific, DBMS-version-specific and query-specific.
Historically, explicit joins usually win, hence the established wisdom that joins are better, but optimisers are getting better all the time, and so I prefer to write queries first in a logically coherent way, and then restructure if performance constraints warrant this.
In most cases JOINs are faster than sub-queries and it is very rare for a sub-query to be faster.
In JOINs RDBMS can create an execution plan that is better for your query and can predict what data should be loaded to be processed and save time, unlike the sub-query where it will run all the queries and load all their data to do the processing.
The good thing in sub-queries is that they are more readable than JOINs: that's why most new SQL people prefer them; it is the easy way; but when it comes to performance, JOINS are better in most cases even though they are not hard to read too.
Taken from the MySQL manual (13.2.10.11 Rewriting Subqueries as Joins):
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.
So subqueries can be slower than LEFT [OUTER] JOIN, but in my opinion their strength is slightly higher readability.
In the year 2010 I would have joined the author of this questions and would have strongly voted for JOIN, but with much more experience (especially in MySQL) I can state: Yes subqueries can be better. I've read multiple answers here; some stated subqueries are faster, but it lacked a good explanation. I hope I can provide one with this (very) late answer:
First of all, let me say the most important: There are different forms of sub-queries
And the second important statement: Size matters
If you use sub-queries, you should be aware of how the DB-Server executes the sub-query. Especially if the sub-query is evaluated once or for every row!
On the other side, a modern DB-Server is able to optimize a lot. In some cases a subquery helps optimizing a query, but a newer version of the DB-Server might make the optimization obsolete.
Sub-queries in Select-Fields
SELECT moo, (SELECT roger FROM wilco WHERE moo = me) AS bar FROM foo
Be aware that a sub-query is executed for every resulting row from foo.
Avoid this if possible; it may drastically slow down your query on huge datasets. However, if the sub-query has no reference to foo it can be optimized by the DB-server as static content and could be evaluated only once.
Sub-queries in the Where-statement
SELECT moo FROM foo WHERE bar = (SELECT roger FROM wilco WHERE moo = me)
If you are lucky, the DB optimizes this internally into a JOIN. If not, your query will become very, very slow on huge datasets because it will execute the sub-query for every row in foo, not just the results like in the select-type.
Sub-queries in the Join-statement
SELECT moo, bar
FROM foo
LEFT JOIN (
SELECT MIN(bar), me FROM wilco GROUP BY me
) ON moo = me
This is interesting. We combine JOIN with a sub-query. And here we get the real strength of sub-queries. Imagine a dataset with millions of rows in wilco but only a few distinct me. Instead of joining against a huge table, we have now a smaller temporary table to join against. This can result in much faster queries depending on database size. You can have the same effect with CREATE TEMPORARY TABLE ... and INSERT INTO ... SELECT ..., which might provide better readability on very complex queries (but can lock datasets in a repeatable read isolation level).
Nested sub-queries
SELECT VARIANCE(moo)
FROM (
SELECT moo, CONCAT(roger, wilco) AS bar
FROM foo
HAVING bar LIKE 'SpaceQ%'
) AS temp_foo
GROUP BY moo
You can nest sub-queries in multiple levels. This can help on huge datasets if you have to group or change the results. Usually the DB-Server creates a temporary table for this, but sometimes you do not need some operations on the whole table, only on the resultset. This might provide a much better performance depending on the size of the table.
Conclusion
Sub-queries are no replacement for a JOIN and you should not use them like this (although possible). In my humble opinion, the correct use of a sub-query is the use as a quick replacement of CREATE TEMPORARY TABLE .... A good sub-query reduces a dataset in a way you cannot accomplish in an ON statement of a JOIN. If a sub-query has one of the keywords GROUP BY or DISTINCT and is preferably not situated in the select fields or the where statement, then it might improve performance a lot.
Use EXPLAIN to see how your database executes the query on your data. There is a huge "it depends" in this answer...
PostgreSQL can rewrite a subquery to a join or a join to a subquery when it thinks one is faster than the other. It all depends on the data, indexes, correlation, amount of data, query, etc.
First of all, to compare the two first you should distinguish queries with subqueries to:
a class of subqueries that always have corresponding equivalent query written with joins
a class of subqueries that can not be rewritten using joins
For the first class of queries a good RDBMS will see joins and subqueries as equivalent and will produce same query plans.
These days even mysql does that.
Still, sometimes it does not, but this does not mean that joins will always win - I had cases when using subqueries in mysql improved performance. (For example if there is something preventing mysql planner to correctly estimate the cost and if the planner doesn't see the join-variant and subquery-variant as same then subqueries can outperform the joins by forcing a certain path).
Conclusion is that you should test your queries for both join and subquery variants if you want to be sure which one will perform better.
For the second class the comparison makes no sense as those queries can not be rewritten using joins and in these cases subqueries are natural way to do the required tasks and you should not discriminate against them.
I think what has been under-emphasized in the cited answers is the issue of duplicates and problematic results that may arise from specific (use) cases.
(although Marcelo Cantos does mention it)
I will cite the example from Stanford's Lagunita courses on SQL.
Student Table
+------+--------+------+--------+
| sID | sName | GPA | sizeHS |
+------+--------+------+--------+
| 123 | Amy | 3.9 | 1000 |
| 234 | Bob | 3.6 | 1500 |
| 345 | Craig | 3.5 | 500 |
| 456 | Doris | 3.9 | 1000 |
| 567 | Edward | 2.9 | 2000 |
| 678 | Fay | 3.8 | 200 |
| 789 | Gary | 3.4 | 800 |
| 987 | Helen | 3.7 | 800 |
| 876 | Irene | 3.9 | 400 |
| 765 | Jay | 2.9 | 1500 |
| 654 | Amy | 3.9 | 1000 |
| 543 | Craig | 3.4 | 2000 |
+------+--------+------+--------+
Apply Table
(applications made to specific universities and majors)
+------+----------+----------------+----------+
| sID | cName | major | decision |
+------+----------+----------------+----------+
| 123 | Stanford | CS | Y |
| 123 | Stanford | EE | N |
| 123 | Berkeley | CS | Y |
| 123 | Cornell | EE | Y |
| 234 | Berkeley | biology | N |
| 345 | MIT | bioengineering | Y |
| 345 | Cornell | bioengineering | N |
| 345 | Cornell | CS | Y |
| 345 | Cornell | EE | N |
| 678 | Stanford | history | Y |
| 987 | Stanford | CS | Y |
| 987 | Berkeley | CS | Y |
| 876 | Stanford | CS | N |
| 876 | MIT | biology | Y |
| 876 | MIT | marine biology | N |
| 765 | Stanford | history | Y |
| 765 | Cornell | history | N |
| 765 | Cornell | psychology | Y |
| 543 | MIT | CS | N |
+------+----------+----------------+----------+
Let's try to find the GPA scores for students that have applied to CS major (regardless of the university)
Using a subquery:
select GPA from Student where sID in (select sID from Apply where major = 'CS');
+------+
| GPA |
+------+
| 3.9 |
| 3.5 |
| 3.7 |
| 3.9 |
| 3.4 |
+------+
The average value for this resultset is:
select avg(GPA) from Student where sID in (select sID from Apply where major = 'CS');
+--------------------+
| avg(GPA) |
+--------------------+
| 3.6800000000000006 |
+--------------------+
Using a join:
select GPA from Student, Apply where Student.sID = Apply.sID and Apply.major = 'CS';
+------+
| GPA |
+------+
| 3.9 |
| 3.9 |
| 3.5 |
| 3.7 |
| 3.7 |
| 3.9 |
| 3.4 |
+------+
average value for this resultset:
select avg(GPA) from Student, Apply where Student.sID = Apply.sID and Apply.major = 'CS';
+-------------------+
| avg(GPA) |
+-------------------+
| 3.714285714285714 |
+-------------------+
It is obvious that the second attempt yields misleading results in our use case, given that it counts duplicates for the computation of the average value.
It is also evident that usage of distinct with the join - based statement will not eliminate the problem, given that it will erroneously keep one out of three occurrences of the 3.9 score. The correct case is to account for TWO (2) occurrences of the 3.9 score given that we actually have TWO (2) students with that score that comply with our query criteria.
It seems that in some cases a sub-query is the safest way to go, besides any performance issues.
MSDN Documentation for SQL Server says
Many Transact-SQL statements that include subqueries can be alternatively formulated as joins. Other questions can be posed only with subqueries. In Transact-SQL, there is usually no performance difference between a statement that includes a subquery and a semantically equivalent version that does not. However, in some cases where existence must be checked, a join yields better performance. Otherwise, the nested query must be processed for each result of the outer query to ensure elimination of duplicates. In such cases, a join approach would yield better results.
so if you need something like
select * from t1 where exists select * from t2 where t2.parent=t1.id
try to use join instead. In other cases, it makes no difference.
I say: Creating functions for subqueries eliminate the problem of cluttter and allows you to implement additional logic to subqueries. So I recommend creating functions for subqueries whenever possible.
Clutter in code is a big problem and the industry has been working on avoiding it for decades.
As per my observation like two cases, if a table has less then 100,000 records then the join will work fast.
But in the case that a table has more than 100,000 records then a subquery is best result.
I have one table that has 500,000 records on that I created below query and its result time is like
SELECT *
FROM crv.workorder_details wd
inner join crv.workorder wr on wr.workorder_id = wd.workorder_id;
Result : 13.3 Seconds
select *
from crv.workorder_details
where workorder_id in (select workorder_id from crv.workorder)
Result : 1.65 Seconds
Run on a very large database from an old Mambo CMS:
SELECT id, alias
FROM
mos_categories
WHERE
id IN (
SELECT
DISTINCT catid
FROM mos_content
);
0 seconds
SELECT
DISTINCT mos_content.catid,
mos_categories.alias
FROM
mos_content, mos_categories
WHERE
mos_content.catid = mos_categories.id;
~3 seconds
An EXPLAIN shows that they examine the exact same number of rows, but one takes 3 seconds and one is near instant. Moral of the story? If performance is important (when isn't it?), try it multiple ways and see which one is fastest.
And...
SELECT
DISTINCT mos_categories.id,
mos_categories.alias
FROM
mos_content, mos_categories
WHERE
mos_content.catid = mos_categories.id;
0 seconds
Again, same results, same number of rows examined. My guess is that DISTINCT mos_content.catid takes far longer to figure out than DISTINCT mos_categories.id does.
A general rule is that joins are faster in most cases (99%).
The more data tables have, the subqueries are slower.
The less data tables have, the subqueries have equivalent speed as joins.
The subqueries are simpler, easier to understand, and easier to read.
Most of the web and app frameworks and their "ORM"s and "Active record"s generate queries with subqueries, because with subqueries are easier to split responsibility, maintain code, etc.
For smaller web sites or apps subqueries are OK, but for larger web sites and apps you will often have to rewrite generated queries to join queries, especial if a query uses many subqueries in the query.
Some people say "some RDBMS can rewrite a subquery to a join or a join to a subquery when it thinks one is faster than the other.", but this statement applies to simple cases, surely not for complicated queries with subqueries which actually cause a problems in performance.
Subqueries are generally used to return a single row as an atomic value, though they may be used to compare values against multiple rows with the IN keyword. They are allowed at nearly any meaningful point in a SQL statement, including the target list, the WHERE clause, and so on. A simple sub-query could be used as a search condition. For example, between a pair of tables:
SELECT title
FROM books
WHERE author_id = (
SELECT id
FROM authors
WHERE last_name = 'Bar' AND first_name = 'Foo'
);
Note that using a normal value operator on the results of a sub-query requires that only one field must be returned. If you're interested in checking for the existence of a single value within a set of other values, use IN:
SELECT title
FROM books
WHERE author_id IN (
SELECT id FROM authors WHERE last_name ~ '^[A-E]'
);
This is obviously different from say a LEFT-JOIN where you just want to join stuff from table A and B even if the join-condition doesn't find any matching record in table B, etc.
If you're just worried about speed you'll have to check with your database and write a good query and see if there's any significant difference in performance.
MySQL version: 5.5.28-0ubuntu0.12.04.2-log
I was also under the impression that JOIN is always better than a sub-query in MySQL, but EXPLAIN is a better way to make a judgment. Here is an example where sub queries work better than JOINs.
Here is my query with 3 sub-queries:
EXPLAIN SELECT vrl.list_id,vrl.ontology_id,vrl.position,l.name AS list_name, vrlih.position AS previous_position, vrl.moved_date
FROM `vote-ranked-listory` vrl
INNER JOIN lists l ON l.list_id = vrl.list_id
INNER JOIN `vote-ranked-list-item-history` vrlih ON vrl.list_id = vrlih.list_id AND vrl.ontology_id=vrlih.ontology_id AND vrlih.type='PREVIOUS_POSITION'
INNER JOIN list_burial_state lbs ON lbs.list_id = vrl.list_id AND lbs.burial_score < 0.5
WHERE vrl.position <= 15 AND l.status='ACTIVE' AND l.is_public=1 AND vrl.ontology_id < 1000000000
AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=43) IS NULL
AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=55) IS NULL
AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=246403) IS NOT NULL
ORDER BY vrl.moved_date DESC LIMIT 200;
EXPLAIN shows:
+----+--------------------+----------+--------+-----------------------------------------------------+--------------+---------+-------------------------------------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+----------+--------+-----------------------------------------------------+--------------+---------+-------------------------------------------------+------+--------------------------+
| 1 | PRIMARY | vrl | index | PRIMARY | moved_date | 8 | NULL | 200 | Using where |
| 1 | PRIMARY | l | eq_ref | PRIMARY,status,ispublic,idx_lookup,is_public_status | PRIMARY | 4 | ranker.vrl.list_id | 1 | Using where |
| 1 | PRIMARY | vrlih | eq_ref | PRIMARY | PRIMARY | 9 | ranker.vrl.list_id,ranker.vrl.ontology_id,const | 1 | Using where |
| 1 | PRIMARY | lbs | eq_ref | PRIMARY,idx_list_burial_state,burial_score | PRIMARY | 4 | ranker.vrl.list_id | 1 | Using where |
| 4 | DEPENDENT SUBQUERY | list_tag | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.l.list_id,const | 1 | Using where; Using index |
| 3 | DEPENDENT SUBQUERY | list_tag | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.l.list_id,const | 1 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | list_tag | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.l.list_id,const | 1 | Using where; Using index |
+----+--------------------+----------+--------+-----------------------------------------------------+--------------+---------+-------------------------------------------------+------+--------------------------+
The same query with JOINs is:
EXPLAIN SELECT vrl.list_id,vrl.ontology_id,vrl.position,l.name AS list_name, vrlih.position AS previous_position, vrl.moved_date
FROM `vote-ranked-listory` vrl
INNER JOIN lists l ON l.list_id = vrl.list_id
INNER JOIN `vote-ranked-list-item-history` vrlih ON vrl.list_id = vrlih.list_id AND vrl.ontology_id=vrlih.ontology_id AND vrlih.type='PREVIOUS_POSITION'
INNER JOIN list_burial_state lbs ON lbs.list_id = vrl.list_id AND lbs.burial_score < 0.5
LEFT JOIN list_tag lt1 ON lt1.list_id = vrl.list_id AND lt1.tag_id = 43
LEFT JOIN list_tag lt2 ON lt2.list_id = vrl.list_id AND lt2.tag_id = 55
INNER JOIN list_tag lt3 ON lt3.list_id = vrl.list_id AND lt3.tag_id = 246403
WHERE vrl.position <= 15 AND l.status='ACTIVE' AND l.is_public=1 AND vrl.ontology_id < 1000000000
AND lt1.list_id IS NULL AND lt2.tag_id IS NULL
ORDER BY vrl.moved_date DESC LIMIT 200;
and the output is:
+----+-------------+-------+--------+-----------------------------------------------------+--------------+---------+---------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------------------------------+--------------+---------+---------------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | lt3 | ref | list_tag_key,list_id,tag_id | tag_id | 5 | const | 2386 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | l | eq_ref | PRIMARY,status,ispublic,idx_lookup,is_public_status | PRIMARY | 4 | ranker.lt3.list_id | 1 | Using where |
| 1 | SIMPLE | vrlih | ref | PRIMARY | PRIMARY | 4 | ranker.lt3.list_id | 103 | Using where |
| 1 | SIMPLE | vrl | ref | PRIMARY | PRIMARY | 8 | ranker.lt3.list_id,ranker.vrlih.ontology_id | 65 | Using where |
| 1 | SIMPLE | lt1 | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.lt3.list_id,const | 1 | Using where; Using index; Not exists |
| 1 | SIMPLE | lbs | eq_ref | PRIMARY,idx_list_burial_state,burial_score | PRIMARY | 4 | ranker.vrl.list_id | 1 | Using where |
| 1 | SIMPLE | lt2 | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.lt3.list_id,const | 1 | Using where; Using index |
+----+-------------+-------+--------+-----------------------------------------------------+--------------+---------+---------------------------------------------+------+----------------------------------------------+
A comparison of the rows column tells the difference and the query with JOINs is using Using temporary; Using filesort.
Of course when I run both the queries, the first one is done in 0.02 secs, the second one does not complete even after 1 min, so EXPLAIN explained these queries properly.
If I do not have the INNER JOIN on the list_tag table i.e. if I remove
AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=246403) IS NOT NULL
from the first query and correspondingly:
INNER JOIN list_tag lt3 ON lt3.list_id = vrl.list_id AND lt3.tag_id = 246403
from the second query, then EXPLAIN returns the same number of rows for both queries and both these queries run equally fast.
Subqueries have ability to calculate aggregation functions on a fly.
E.g. Find minimal price of the book and get all books which are sold with this price.
1) Using Subqueries:
SELECT titles, price
FROM Books, Orders
WHERE price =
(SELECT MIN(price)
FROM Orders) AND (Books.ID=Orders.ID);
2) using JOINs
SELECT MIN(price)
FROM Orders;
-----------------
2.99
SELECT titles, price
FROM Books b
INNER JOIN Orders o
ON b.ID = o.ID
WHERE o.price = 2.99;
The difference is only seen when the second joining table has significantly more data than the primary table. I had an experience like below...
We had a users table of one hundred thousand entries and their membership data (friendship) about 3 hundred thousand entries. It was a join statement in order to take friends and their data, but with a great delay. But it was working fine where there was only a small amount of data in the membership table. Once we changed it to use a sub-query it worked fine.
But in the mean time the join queries are working with other tables that have fewer entries than the primary table.
So I think the join and sub query statements are working fine and it depends on the data and the situation.
These days, many dbs can optimize subqueries and joins. Thus, you just gotto examine your query using explain and see which one is faster. If there is not much difference in performance, I prefer to use subquery as they are simple and easier to understand.
I am not a relational database expert, so take this with a grain of salt.
The general idea about sub queries vs joins is the path the evaluation of the larger query takes.
In order to perform the larger query, every individual subquery has to be executed first, and then the resultset is stored as a temporary table that the larger query interacts with.
This temporary table is unindexed, so, any comparison requires scanning the whole resultset.
In contrast, when you use a join, all indexes are in use and so, comparison require traversing index trees (or hash tables), which is way less expensive in terms of speed.
Now, what I don't know if newer versions of the most popular relational engines execute the evaluation on reverse, and just load the necessary elements in the temporary table, as an optimization method.
I just thinking about the same problem, but I am using subquery in the FROM part.
I need connect and query from large tables, the "slave" table have 28 million record but the result is only 128 so small result big data! I am using MAX() function on it.
First I am using LEFT JOIN because I think that is the correct way, the mysql can optimalize etc.
Second time just for testing, I rewrite to sub-select against the JOIN.
LEFT JOIN runtime: 1.12s
SUB-SELECT runtime: 0.06s
18 times faster the subselect than the join! Just in the chokito adv. The subselect looks terrible but the result ...
It depends on several factors, including the specific query you're running, the amount of data in your database. Subquery runs the internal queries first and then from the result set again filter out the actual results. Whereas in join runs the and produces the result in one go.
The best strategy is that you should test both the join solution and the subquery solution to get the optimized solution.
If you want to speed up your query using join:
For "inner join/join",
Don't use where condition instead use it in "ON" condition.
Eg:
select id,name from table1 a
join table2 b on a.name=b.name
where id='123'
Try,
select id,name from table1 a
join table2 b on a.name=b.name and a.id='123'
For "Left/Right Join",
Don't use in "ON" condition, Because if you use left/right join it will get all rows for any one table.So, No use of using it in "On". So, Try to use "Where" condition