SQL - IN vs. NOT IN

SQL - IN vs. NOT IN - sql

Suppose I have a table with column which takes values from 1 to 10. I need to select columns with all values except for 9 and 10. Will there be a difference (performance-wise) when I use this query:
SELECT * FROM tbl WHERE col NOT IN (9, 10)
and this one?
SELECT * FROM tbl WHERE col IN (1, 2, 3, 4, 5, 6, 7, 8)

Use "IN" as it will most likely make the DBMS use an index on the corresponding column.
"NOT IN" could in theory also be translated into an index usage, but in a more complicated way which DBMS might not "spend overhead time" using.

When it comes to performance you should always profile your code (i.e. run your queries few thousand times and measure each loops performance using some kind of stopwatch. Sample).
But here I highly recommend using the first query for better future maintaining. The logic is that you need all records but 9 and 10. If you add value 11 to your table and use second query, logic of your application will be broken that will lead to bug, of course.
Edit: I remember this was tagged as php that's why I provided sample in php, but I might be mistaken. I guess it won't be hard to rewrite that sample in the language you're using.

I have seen Oracle have trouble optimizing some queries with NOT IN if columns are nullable. If you can write your query either way, IN is preferred as far as I'm concerned.

For a list of constants, MySQL will internally expand your code to:
SELECT * FROM tbl WHERE ((col <> 9 and col <> 10))
Same for the other one, with 8 times = instead.
So yes, the first one will be faster, less comparisons to be done. Chances that it is measurable are negligible though, the overhead of a handful of constant comparisons is nothing compared to the general overhead of parsing SQL and retrieving data.

"IN" statement works internally like a serie of "OR" statements.
For example:
SELECT * FROM tbl WHERE col IN (1, 2, 3)
Its equals to
SELECT * FROM tbl WHERE col = 1 OR col = 2 OR col = 3
"OR" statements could cause some performance issues as explained in this article:
https://bertwagner.com/2018/02/20/or-vs-union-all-is-one-better-for-performance/
When you do a NOT IN statement, its all the same, but the result has a logical denial. BUT, you could write and equivalent query much better in performance. In your example:
SELECT * FROM tbl WHERE col NOT IN (9, 10)
Its equals to
SELECT * FROM tbl WHERE col <> 9 AND col <> 10
With an "AND" statement, the database stop analizing when one of all conditionals its false, so, its much better in performance than "OR" used in "IN" statement.

Related

Repeating operations vs multilevel queries

I was always bothered by how should I approach those, which solution is better. I guess the sample code should explain it better.
Lets imagine we have a table that has 3 columns:
(int)Id
(nvarchar)Name
(int)Value
I want to get the basic columns plus a number of calculations on the Value column, but with each of the calculation being based on a previous one, In other words something like this:
SELECT
*,
Value + 10 AS NewValue1,
Value / NewValue1 AS SomeOtherValue,
(Value + NewValue1 + SomeOtherValue) / 10 AS YetAnotherValue
FROM
MyTable
WHERE
Name LIKE "A%"
Obviously this will not work. NewValue1, SomeOtherValue and YetAnotherValue are on the same level in the query so they can't refer to each other in the calculations.
I know of two ways to write queries that will give me the desired result. The first one involves repeating the calculations.
SELECT
*,
Value + 10 AS NewValue1,
Value / (Value + 10) AS SomeOtherValue,
(Value + (Value + 10) + (Value / (Value + 10))) / 10 AS YetAnotherValue
FROM
MyTable
WHERE
Name LIKE "A%"
The other one involves constructing a multilevel query like this:
SELECT
t2.*,
(t2.Value + t2.NewValue1 + t2.SomeOtherValue) / 10 AS YetAnotherValue
FROM
(
SELECT
t1.*,
t1.Value / t1.NewValue1 AS SomeOtherValue
FROM
(
SELECT
*,
Value + 10 AS NewValue1
FROM
MyTable
WHERE
Name LIKE "A%"
) t1
) t2
But which one is the right way to approach the problem or simply "better"?
P.S. Yes, I know that "better" or even "good" solution isn't always the same thing in SQL and will depend on many factors.

I have tired a number of different combination of calculations in both variants. They always produced the same execution plan, so it could be assumed that there is no difference in the performance aspect. From the code usability perspective the first approach i obviously better as the code is more readable and compact.

There is no "right" way to write such queries. SQL Server, as with most databases (MySQL being a notable exception), does not create intermediate tables for each subquery. Instead, it optimizes the query as a whole and often moves all the calculations for the expressions into a single processing node.
The reason that column aliases cannot be re-used at the same level goes to the ANSI standard definition. In particular, nothing in the standard specifies the order of evaluation for the individual expressions. Without knowing the order, SQL cannot guarantee that the variable is defined before evaluated.
I often write multi-level queries -- either using subqueries or CTEs -- to make queries more readable and more maintainable. But then again, I will also copy logic from one variable to the other because it is expedient. In my opinion, this is something that the writer of the query needs to decide on, taking into account whether the query is part of the code for a system that needs to be maintained, local coding standards, whether the query is likely to be modified, and similar considerations.

SQL Query concatenation performance boost

I have a class (that reflects a db row) that has more than 200 instances created within the codes bootstrap. Each one has a single SELECT query with the only condition being WHERE 'tblA'.'AID' = #, but I was thinking of creating a single query, that would parse 200 WHERE clauses connected by OR logic, then from the result, 200 objects are created with the data already found, so there is only 1 query.
I am implementing this on a test server at the moment, but I was wondering if this was a bad step for efficiency, and at what time it would be better to do 2 sets of queries, taking care of half the clauses each (or how ever many more need to be made)?
Additional, I am also writing a performance enhancer into it to replace something like
WHERE `tblA`.`AID` = 2 OR `tblA`.`AID` = 3 OR `tblA`.`AID` = 5 OR `tblA`.`AID` = 6 OR `tblA`.`AID` = 7
with
WHERE (`tblA`.`AID` >= 2 AND `tblA`.`AID` <= 3) OR (`tblA`.`AID` >= 5 AND `tblA`.`AID` <= 7)
or even
WHERE `tblA`.`AID` >= 2 AND `tblA`.`AID` <= 7 AND `tblA`.`AID` <> 4

If you have a discrete list, then just use in . . .
where AID in (2, 3, 5, 6, 7, . . .)
And let the SQL engine worry about the optimization.
The biggest hit is likely to be the time to parse the query and sending a large query to the engine. If your list gets really long, then consider putting the list in a temporary table, building an index on the table, and doing a join.
You don't specify what database you are using, but this advice is pretty database-agnostic.

You still haven't specified DBMS.
For SQL Server this might be modestly worthwhile (though you may well want to consider joining on a table valued parameter or similar rather than having a lengthy IN list anyway).
SQL Server will do separate individual seeks rather than collapse them into contiguous ranges. This is covered thoroughly in the article When is a Seek not a Seek? but some examples below.
CREATE TABLE T (X int PRIMARY KEY)
INSERT INTO T
SELECT TOP 1000000 ROW_NUMBER() OVER (ORDER BY ##SPID)
FROM master..spt_values v1, master..spt_values v2
SET STATISTICS IO ON;
SELECT *
FROM T
WHERE X IN ( 1, 2, 3, 4, 5, 6, 7, 8, 9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30,
31,32,33,34,35,36,37,38,39,40,
41,42,43,44,45,46,47,48,49,50,
51,52,53,54,55,56,57,58,59,60,
61,62,63,64)
Table 'T'. Scan count 64, logical reads 192
SELECT *
FROM T
WHERE X BETWEEN 1 AND 64
Table 'T'. Scan count 1, logical reads 3
As mentioned in the comments on the article for greater than 64 values you will get a slightly different plan that adds a table of constants and a nested loops join into the mix.

bitwise mask vs IN() efficiency in sqlite?

I have two ways to select a set of entries from the database:
SELECT ... WHERE `level` IN (1,2,4,8) LIMIT ...;
or
SELECT ... WHERE `level` & mask LIMIT ...;
There are 4 'levels' total, numbered 1,2,4,8 (for reasons of ability to use the same mask elsewhere too). Both the braces of IN() or the mask can contain any set of one or more of the 4 levels. The column is indexed. The query is still taking way longer than comfortable, and we're trying to optimize it for speed.
Yesterday one person said decided using naive IN() results in up to four comparisons and that I should be using a bit mask instead. Today I heard the bit mask will completely thwart advantages from index on the column, and will be much slower.
Can you tell me which approach will be faster?

Your question is quite old but I'm still gonna answer it nonetheless.
The bitmask will most probably be slower as it has to work out the computation of the bitwise AND whereas IN will use the indexed value of level to look it up in the arguments enclosed within the parentheses (which, I believe, should be a single O(log(n)) operation).
Now, the thing that you may be missing, is that they don't do the same thing.
Your first query will simply check if level is either 1, 2, 4 or 8.
Your second query, or actually something like:
SELECT ... WHERE (`level` & mask) = mask LIMIT ...;
Has the ability to lookup levels that contain the mask you want and potentially some more, in your case it could be checking all value combinations between 1 and 15. Hence the performance hit.
As for the bruteforced benchmark #AlanFoster suggested, I don't agree with him.
It's far better to prefix the query with either:
EXPLAIN, or
EXPLAIN QUERY PLAN
And inspect how many rows SQLite is matching.
Update
EXPLAIN QUERY PLAN SELECT * FROM ... WHERE level IN (2, 3);
SEARCH TABLE ... USING INDEX ..._level (level=?) (~20 rows)
EXPLAIN QUERY PLAN SELECT * FROM ... WHERE (level & 2) = 2;
SCAN TABLE ... (~500000 rows)
As you can see, the bitwise AND operator needs a full-table scan.

sql server 'in' or 'or' - which is fastest

Ok, so I have a query:
select distinct(a)
from mytable
where
b in (0,3)
What is going to be faster, the above or
select distinct(a)
from mytable
where
b = 0
or
b = 3
Is there a general rule?
Thanks

Both IN and OR will do a query for b = 0 followed by one for b = 3, and then do a merge join on the two result sets, and finally filter out any duplicates.
With IN, duplicates doesn't really make sense, because b can't both be 0 and 3, but the fact is that IN will be converted to b = 0 OR b = 3, and with OR, duplicates do make sense, because you could have b = 0 OR a = 3, and if you were to join the two separate result sets, you could end up with duplicates for each record that matched both criteria.
So a duplicate filtering will always be done, regardless of whether you're using IN or OR. However, if you know from the outset that you will not have any duplicates - which is usually the case when you're using IN - then you can gain some performance by using UNION ALL which doesn't filter out duplicates:
select distinct(a)
from mytable
where
b = 0
UNION ALL
select distinct(a)
from mytable
where
b = 3

As far as I know, IN converts to OR. So the performance is the same. Just a shorter way of writing it.

Hopefully in this simple example it won't make any difference which version you use (as the query optimiser should turn them into equivalent queries under the hood), however there's a fair chance it's going to be dependent on the indexes you have on mytable. I would suggest that you run both queries in Sql Server Management Studio after having turned on "Include Actual Execution Plan", and compare the results to determine which query has the lowest "cost" in your scenario.
To do this:
Put your query(s) into a new Sql Sever Management Studio query window
Right click on the window in the space you've typed into
Click "Include Actual Execution Plan"
Run your query as you would usually
The bottom "results" half of the window will now have a 3rd tab showing, "Execution Plan" which should contain two "flowcharts", one for the first query and another for the second. If the two are identical, then Sql Server has treated the two queries as equivalent and therefore you should choose whichever form you and/or your colleagues prefer.

SQL Performance: UNION or ORDER BY

The problem: we have a very complex search query. If its result yields too few rows we expand the result by UNIONing the query with a less strict version of the same query.
We are discussing wether a different approach would be faster and/or better in quality. Instead of UNIONing we would create a custom sql function which would return a matching score. Then we could simply order by that matching score.
Regarding performance: will it be slower than a UNION?
We use PostgreSQL.
Any suggestions would be greatly appreciated.
Thank you very much
Max

A definitive answer can only be given if you measure the performance of both approaches in realistic environments. Everything else is guesswork at best.
There are so many variables at play here - the structure of the tables and the types of data in them, the distribution of the data, what kind of indices you have at your disposal, how heavy the load on the server is - it's almost impossible to predict any outcome, really.
So really - my best advice is: try both approaches, on the live system, with live data, not just with a few dozen test rows - and measure, measure, measure.
Marc

You want to order by the "return value" of your custom function? Then the database server can't use an index for that. The score has to be calculated for each record in the table (that hasn't been excluded with a WHERE clause) and stored in some temporary storage/table. Then the order by is performed on that temporary table. So this easily can get slower than your union queries (depending on your union statements of course).

To add my little bit...
+1 to marc_s, completely agree with what he said - I would only say, you need a test db server with realistic data volumes in to test on, as opposed to production server.
For the function approach, the function would be executed for each record, and then ordered by that result - this will not be an indexed column and so I'd expect to see a negative impact in performance. However, how big that impact is and whether it is actually negative when compared to the cumulative time of the other approach, is only going to be known by testing.

In PostgreSQL 8.3 and below, UNION implied DISTINCT which implied sorting, that means ORDER BY, UNION and DISTINCT were always of same efficiency, since the atter two aways used sorting.
On PostgreSQL 8.3, this query returns the sorted results:
SELECT *
FROM generate_series(1, 10) s
UNION
SELECT *
FROM generate_series(5, 15) s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Since PostgreSQL 8.4 it became possible to use HashAggregate for UNION which may be faster (and almost always is), but does not guarantee ordered output.
The same query returns the following on PostgreSQL 8.4:
SELECT *
FROM generate_series(1, 10) s
UNION
SELECT *
FROM generate_series(5, 15) s
10
15
8
6
7
11
12
2
13
5
4
1
3
14
9
, and as you can see the resuts are not sorted.
PostgreSQL change list mentions this:
SELECT DISTINCT and UNION/INTERSECT/EXCEPT no longer always produce sorted output (Tom)
So in new PostgreSQL versions, I'd advice to use UNION, since it's more flexible.
In old versions, the performance will be the same.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL - IN vs. NOT IN - sql

Use "IN" as it will most likely make the DBMS use an index on the corresponding column. "NOT IN" could in theory also be translated into an index usage, but in a more complicated way which DBMS might not "spend overhead time" using.

I have seen Oracle have trouble optimizing some queries with NOT IN if columns are nullable. If you can write your query either way, IN is preferred as far as I'm concerned.

Related

Repeating operations vs multilevel queries

SQL Query concatenation performance boost

bitwise mask vs IN() efficiency in sqlite?

sql server 'in' or 'or' - which is fastest

SQL Performance: UNION or ORDER BY

Categories

Resources