Is the following the most efficient in SQL to achieve its result:
SELECT *
FROM Customers
WHERE Customer_ID NOT IN (SELECT Cust_ID FROM SUBSCRIBERS)
Could some use of joins be better and achieve the same result?
Any mature enough SQL database should be able to execute that just as effectively as the equivalent JOIN. Use whatever is more readable to you.
One reason why you might prefer to use a JOIN rather than NOT IN is that if the Values in the NOT IN clause contain any NULLs you will always get back no results. If you do use NOT IN remember to always consider whether the sub query might bring back a NULL value!
RE: Question in Comments
'x' NOT IN (NULL,'a','b')
≡ 'x' <> NULL and 'x' <> 'a' and 'x' <>
'b'
≡ Unknown and True and True
≡ Unknown
Maybe try this
Select cust.*
From dbo.Customers cust
Left Join dbo.Subscribers subs on cust.Customer_ID = subs.Customer_ID
Where subs.Customer_Id Is Null
SELECT Customers.*
FROM Customers
WHERE NOT EXISTS (
SELECT *
FROM SUBSCRIBERS AS s
JOIN s.Cust_ID = Customers.Customer_ID)
When using “NOT IN”, the query performs nested full table scans, whereas for “NOT EXISTS”, the query can use an index within the sub-query.
If you want to know which is more effective, you should try looking at the estimated query plans, or the actual query plans after execution. It'll tell you the costs of the queries (I find CPU and IO cost to be interesting). I wouldn't be surprised much if there's little to no difference, but you never know. I've seen certain queries use multiple cores on our database server, while a rewritten version of that same query would only use one core (needless to say, the query that used all 4 cores was a good 3 times faster). Never really quite put my finger on why that is, but if you're working with large result sets, such differences can occur without your knowing about it.
Related
I have a Netezza query with a WHERE clause that includes several hundred potential strings. I'm surprised that it runs, but it takes time to complete and occasionally errors out ('transaction rolled back by client'). Here's a pseudo code version of my query.
SELECT
TO_CHAR(X.I_TS, 'YYYY-MM-DD') AS DATE,
X.I_SRC_NM AS CHANNEL,
X.I_CD AS CODE,
COUNT(DISTINCT CASE WHEN X.I_FLG = 1 THEN X.UID ELSE NULL) AS WIDGETS
FROM
(SELECT
A.I_TS,
A.I_SRC_NM,
A.I_CD,
B.UID,
B.I_FLG
FROM
SCHEMA.DATABASE.TABLE_A A
LEFT JOIN SCHEMA.DATABASE.TABLE_B B ON A.UID = B.UID
WHERE
A.I_TS BETWEEN '2017-01-01' AND '2017-01-15'
AND B.TAB_CODE IN ('00AV', '00BX', '00C2', '00DJ'...
...
...
...
...
...
...
...)
) X
GROUP BY
X.I_TS,
X.I_SRC_NM,
X.I_CD
;
In my query, I'm limiting the results on B.TAB_CODE to about 1,200 values (out of more than 10k). I'm honestly surprised that it works at all, but it does most of the time.
Is there a more efficient way to handle this?
If the IN clause becomes too cumbersome, you can make your query in multiple parts. Create a temporary table containing a TAB_CODE set then use it in a JOIN.
WITH tab_codes(tab_code) AS (
SELECT '00AV'
UNION ALL
SELECT '00BX'
--- etc ---
)
SELECT
TO_CHAR(X.I_TS, 'YYYY-MM-DD') AS DATE,
X.I_SRC_NM AS CHANNEL,
--- etc ---
INNER JOIN tab_codes Q ON B.TAB_CODES = Q.tab_code
If you want to boost performance even more, consider using a real temporary table (CTAS)
We've seen situations where it's "cheaper" to CTAS the original table to another, distributed on your primary condition, and then querying that table instead.
If im guessing correctly , the X.I_TS is in fact a ‘timestamp’, and as such i expect it to contain many different values per day. Can you confirm that?
If I’m right the query can possibly benefit from changing the ‘group by X.I._TS,...’ to ‘group by 1,...’
Furthermore the ‘Count(Distinct Case...’ can never return anything else than 1 or NULL. Can you confirm that?
If I’m right on that, you can get rid of the expensive ‘DISTINCT’ by changing it to ‘MAX(Case...’
Can you follow me :)
I am trying to optimize a query that uses an inner join, and I am puzzled by the difference of performance between two very similar queries.
I hope to shed some light on this.
The tables look like this:
Aggregates:
+-recid(key)-+-avg---+
+------------+-------+
History:
+-recid(key)-+-value-+
+------------+-------+
The aim is to get, for a given key (let's assume 1234), avg and value.
I have tried two queries who seem very similar to me:
SELECT a.avg, b.value FROM aggregates a, history b
WHERE a.recid = b.recid
AND a.recid = 1234
Takes 5 seconds to run
But,
SELECT a.avg, b.value FROM aggregates a, history b
WHERE a.recid = 1234
AND b.recid = 1234
runs in less than a second.
Those two queries give the very same result. I would like to understand the huge difference in performance (the end game being a better understanding, to achieve a better performance for this query!)
First, learn to use proper explicit JOIN syntax:
SELECT a.avg, h.value
FROM aggregates a JOIN
history h
ON a.recid = h.recid
WHERE a.recid = 1234;
This doesn't affect performance, but it is the correct, modern syntax.
Assuming you have indexes on aggregates(recid) and history(recid), then the two versions should have very similar execution plans in almost any database I can think of. Those two indexes would be recommended for queries like this.
One possibility is cold versus warm cache. The first time you run a query, the data needs to be loaded into memory. That can take longer. For proper timing, you need to take this into account.
Finally, if you really want to understand the difference, then you need to look at the execution plan. Most databases provide a simple way to "explain" how the query is run.
Not sure absolutely but it could be that your second query execution plan is already cached and thus DB optimizier didn't needed to bring one. BTW, your first query should be changes as below to use ANSI style JOIN syntax
SELECT a.avg, b.value FROM aggregates a
JOIN history b ON a.recid = b.recid
WHERE a.recid = 1234
The second query might be performing a cross join then filtering on the results though it would have to be a very old version of Oracle to be that dumb. But you need to look at the query plan to find out. If they consistently exhibit different performance then I guarantee the query plans will be different.
I have something like this:
SELECT CompanyId
FROM Company
WHERE CompanyId not in
(SELECT CompanyId
FROM Company
WHERE (IsPublic = 0) and CompanyId NOT IN
(SELECT ShoppingLike.WhichId
FROM Company
INNER JOIN
ShoppingLike ON Company.CompanyId = ShoppingLike.UserId
WHERE (ShoppingLike.IsWaiting = 0) AND
(ShoppingLike.ShoppingScoreTypeId = 2) AND
(ShoppingLike.UserId = 75)
)
)
It has 3 select, I want to know how could I have it without making 3 selects, and which one has better speed for 1 million record? "select in select" or "left join"?
My experiences are from Oracle. There is never a correct answer to optimising tricky queries, it's a collaboration between you and the optimiser. You need to check explain plans and sometimes traces, often at each stage of writing the query, to find out what the optimiser in thinking. Having said that:
You could remove the outer SELECT by putting the entire contents of it's subquery WHERE clause in a NOT(...). On the face of it will prevent that outer full scan of Company (or it's index of CompanyId). Try it, check the output is the same and get timings, then remove it temporarily before trying the below. The NOT() may well cause the optimiser to stop considering an ANTI-JOIN against the ShoppingLike subquery due to an implicit OR being created.
Ensure that CompanyId and WhichId are defined as NOT NULL columns. Without this (or the likes of an explicit CompanyId IS NOT NULL) then ANTI-JOIN options are often discarded.
The inner most subquery is not correlated (does not reference anything from it's outer query) so can be extracted and tuned separately. As a matter of style I'd swap the table names round the INNER JOIN as you want ShoppingLike scanned first as it has all the filters against it. It wont make any difference but it reads easier and makes it possible to use a hint to scan tables in the order specified. I would even question the need for the Company table in this subquery.
You've used NOT IN when sometimes the very similar NOT EXISTS gives the optimiser more/alternative options.
All the above is just trial and error unless you start trying the explain plan. Oracle can, with a following wind, convert between LEFT JOIN and IN SELECT. 1M+ rows will create time to invest.
I'm looking to optimize my SQL query and would like to know, in terms of optimization performance, if
SELECT A.this, B.another FROM A
JOIN B
ON A.this = B.that
WHERE B.another > 6
AND A.something < 3;
is better than:
SELECT A.this, B.another
FROM (SELECT this FROM A WHERE A.something < 3) AS A
JOIN (SELECT another FROM B WHERE B.another > 6) AS B
ON A.this = B.that;
The two queries will be identical when run. Try running the two queries preceded by explain analyze, and you should get the exact same query plan.
In a nutshell, Postgres will parse the query and come up with a query tree.
It'll then rewrite the query tree when appropriate, to remove redundant things such as an 1 = 1 where clause, replacing a small IN () clause or with its equivalent using ANY, or, in your case, collapsing your two subqueries into the parent query. The reason it'll do so is that it'll basically view them as select (select ...) with the inner select not being subjected to an aggregate, group by or limit clause.
Only then does it spend some time analyzing the query tree itself in search for a query plan it deems sufficiently optimal. And finally execute the query plan to return the rows you've asked for.
For the gory details you'll want to check out the Postgres manual on performance tips, as well as the explain and explain analyze syntax:
http://www.postgresql.org/docs/current/static/sql-explain.html
http://www.postgresql.org/docs/current/static/performance-tips.html
http://www.postgresql.org/docs/current/static/planner-stats-details.html
http://www.postgresql.org/docs/current/static/geqo.html
https://wiki.postgresql.org/wiki/Performance_Optimization
Also note the Postgres performance mailing list, whose archives you'll find here:
http://www.postgresql.org/list/pgsql-performance/
Studying them is well worth the effort, in the sense that doing so will give you a wealth of insights as to what is going on under the hood. Give attention to messages written by Tom Lane in particular -- he occasionally gives a first hand account of what's going on in the query rewriter and planner.
I often find myself wanting to write an SQL query like the following:
SELECT body
FROM node_revisions
where vid = (SELECT vid
FROM node
WHERE nid = 4);
I know that there are joins and stuff you could do, but they seem to make things more complicated. Are joins a better way to do it? Is it more efficient? Easier to understand?
Joins tend to be more efficient since databases are written with set operations in mind (and joins are set operations).
However, performance will vary from database to database, how the tables are structured, the amount of data in them and how much will be returned by the query.
If the amount of data is small, I would use a subquery like yours rather than a join.
Here is what a join would look like:
SELECT body
FROM node_revisions nr
INNER JOIN node n
ON nr.vid = n.vid
WHERE n.nid = 4
I would not use the query you posted, as there is chance of more than one node record with a nid = 4, which would cause it to fail.
I would use:
SELECT body
FROM node_revisions
WHERE vid IN (SELECT vid
FROM node
WHERE nid = 4);
Is this more readable or understandable? In this case, it's a matter of personal preference.
I think joins are easier to understand and can be more efficient. Your case is pretty simple, so it is probably a toss-up. Here is how I would write it:
SELECT body
FROM node_revisions
inner join node
on (node_revisions.vid = node.vid)
WHERE node.nid = 4
The answer to any performance related questions in databases is it depends, and we're short on details in the OP. Knowing no specifics about your situation... (thus, these are general rules of thumb)
Joins are better and easier to understand
If for some reason you need multiple column keys (fishy), you can continue to use a join and simply tack on another expression to the join condition.
If in the future you really do need to join auxiliary data, the join framework is already there.
It makes it more clear exactly what you're joining on and where indexes should be implemented.
Use of joins makes you better at joins and better at thinking about joins.
Joins are clear about what tables are in play
Written queries have nothing to do with effiency*
The queries you write and what actually gets run have little to do with one another. There are many ways to write a query but only so few ways to fetch the data, and it's up to the query engine to decide. This relates mostly to indexes. It's very possible to write four queries that look totally different but internally do the same thing.
(* It's possible to write a horrible query that is inefficient but it takes a special kind of crazy to do that.)
select
body
from node_revisions nr
join node n
on n.vid = nr.vid
where n.nid = 4
A join is interesting:
select body
from node_revisions nr
join node n on nr.vid = n.vid
where n.vid = 4
But you can also express a join without a join [!]:
select body
from node_revisions nr, node n
where n.nid = 4 and nr.vid = n.vid
Interestingly enough, SQL Server gives a slight different query plan on both queries, while the join has a clustered index scan, the "join without a join" has a clustered index seek in its place, which indicates it's better, at least in this case!
select
body
from node_revisions A
where exists (select 'x'
from Node B
Where A.Vid = B.Vid and B.NID=4)
I don't see anything wrong with what you wrote, and a good optimizer may even change it to a join if it sees fit.
SELECT body
FROM node_revisions
WHERE vid =
(
SELECT vid
FROM node
WHERE nid = 4
)
This query is logically equivalent to a join if and only if nid is a PRIMARY KEY or is covered by a UNIQUE constraint.
Otherwise, the queries are not equivalent: a join will always succeed, while the subquery will fail if there are more that 1 row in node with nid = 4.
If nid is a PRIMARY KEY, then the JOIN and the subquery will have same performance.
In case of a join, node will be made leading
In case of a subquery, the subquery will be executed once and transformed into a const on parsing stage.
The latest MySQL 6.x code will automatically convert that IN expression into an INNER JOIN using a semi-join subquery optimization, making the 2 statements largely equivalent:
http://forge.mysql.com/worklog/task.php?id=3740
but, actually writing it out is pretty simple to do, because INNER JOIN is the default join type, and doing this wouldn't rely on the server optimizing it away (which it might decide not to for some reason and which wouldn't be portable necessarily). all things being equal, why not go with:
select body from node_revisions r, node n where r.vid = n.vid and n.node = 4