Why would this query in Postgres result in a 15-day lock? - sql

Everything in my database was running normally -- reads, writes, lots of activity.
Then I wanted to add a column to the foos table. The foos table became unavailable. I quit the code executing the query and looked at locks in the system. I found the below query had a lock for 15 days. After that was my table-changing query, and after that were a bunch more queries which involved the foos table.
What would cause this query to get stuck for 15 days? This is in 9.1.3
select generate_report, b.count from
(select count(1), date_trunc('hour',f.event_happened_at) from
foos as f, bars as b
where age(f.event_happened_at) <= interval '24 hour' and f.id=b.foo_id and b.thing_type='Dog' and b.thing_id=26631
group by date_trunc('hour',f.event_happened_at)) as e
right join generate_report(date_trunc('hour',now()) - interval '24 hour',now(),interval '1 hour')
on generate_report = b.date_trunc
order by generate_report;
update: info from pg_stat_activity
| backend_start | xact_start | query_start | waiting |
-------+---------+----------+------------------+----------------+-----------------+------------------------
| 2012-11-19 18:38:40.029818+00 | 2012-11-19 18:38:40.145172+00 | 2012-11-19 18:38:40.145172+00 | f |
update: output of explain:
Merge Left Join (cost=14135.74..14138.08 rows=1000 width=16)
Merge Cond: (generate_report.generate_report = (date_trunc('hour'::text, f.event_happened_at)))
-> Sort (cost=12.97..13.47 rows=1000 width=8)
Sort Key: generate_report.generate_report
-> Function Scan on generate_report (cost=0.00..3.00 rows=1000 width=8)
-> Sort (cost=14122.77..14122.81 rows=67 width=16)
Sort Key: (date_trunc('hour'::text, f.event_happened_at))
-> HashAggregate (cost=14121.93..14122.17 rows=67 width=8)
-> Hash Join (cost=3237.14..14121.86 rows=67 width=8)
Hash Cond: (b.foo_id = f.id)
-> Index Scan using index_bars_on_thing_type_and_thing_id_and_baz on bars b (cost=0.00..10859.88 rows=10937 width=4)
Index Cond: (((thing_type)::text = 'Dog'::text) AND (thing_id = 26631))
-> Hash (cost=3131.42..3131.42 rows=30207 width=12)
-> Seq Scan on foos f (cost=0.00..3131.42 rows=30207 width=12)
Filter: (age((('now'::text)::date)::timestamp without time zone, event_happened_at) <= '24:00:00'::interval)

Per the info from pg_stat_activity you posted, it looks like this query is still executing (waiting = f). This means that the lock just has not been released yet.
You may want to start taking a look at your query to see if there are problems with its structure or the query plan it is generating. 15 days is definitely too long, most long running queries should take no more than 10 minutes before they are considered problems.
For assistance with that, you will need to post your table DDL, some sample data, and some idea of how many rows are in each table. That would probably be best posed as a new question, but you can always edit this one.

Related

How to optimize this "select count" SQL? (postgres array comparision)

There is a table, has 10 million records, and it has a column which type is array, it looks like:
id | content | contained_special_ids
----------------------------------------
1 | abc | { 1, 2 }
2 | abd | { 1, 3 }
3 | abe | { 1, 4 }
4 | abf | { 3 }
5 | abg | { 2 }
6 | abh | { 3 }
and I want to know that how many records there is which contained_special_ids includes 3, so my sql is:
select count(*) from my_table where contained_special_ids #> array[3]
It works fine when data is small, however it takes long time (about 30+ seconds) when the table has 10 million records.
I have added index to this column:
"index_my_table_on_contained_special_ids" gin (contained_special_ids)
So, how to optimize this select count query?
Thanks a lot!
UPDATE
below is the explain:
Finalize Aggregate (cost=1049019.17..1049019.18 rows=1 width=8) (actual time=44343.230..44362.224 rows=1 loops=1)
Output: count(*)
-> Gather (cost=1049018.95..1049019.16 rows=2 width=8) (actual time=44340.332..44362.217 rows=3 loops=1)
Output: (PARTIAL count(*))
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=1048018.95..1048018.96 rows=1 width=8) (actual time=44337.615..44337.615 rows=1 loops=3)
Output: PARTIAL count(*)
Worker 0: actual time=44336.442..44336.442 rows=1 loops=1
Worker 1: actual time=44336.564..44336.564 rows=1 loops=1
-> Parallel Bitmap Heap Scan on public.my_table (cost=9116.31..1046912.22 rows=442694 width=0) (actual time=330.602..44304.221 rows=391431 loops=3)
Recheck Cond: (my_table.contained_special_ids #> '{12511}'::bigint[])
Rows Removed by Index Recheck: 501077
Heap Blocks: exact=67496 lossy=109789
Worker 0: actual time=329.547..44301.513 rows=409272 loops=1
Worker 1: actual time=329.794..44304.582 rows=378538 loops=1
-> Bitmap Index Scan on index_my_table_on_contained_special_ids (cost=0.00..8850.69 rows=1062465 width=0) (actual time=278.413..278.414 rows=1176563 loops=1)
Index Cond: (my_table.contained_special_ids #> '{12511}'::bigint[])
Planning Time: 1.041 ms
Execution Time: 44362.262 ms
Increase work_mem until the lossy blocks go away. Also, make sure the table is well vacuumed to support index-only bitmap scans, and that you are using a new enough version (which you should tell us) to support those. Finally, you can try increasing effective_io_concurrency.
Also, post plans as text, not images; and turn on track_io_timing.
There is no way to optimize such a query due to 2 factors :
The use of a non atomic value that violate the FIRST NORMAL FORM
The fact that PostGreSQL is unable to perform quickly aggregate computation
On the first problem... 1st NORMAL FORM
each data in table's colums must be atomic.... Of course an array containing multiple value is not atomic.
Then no index would be efficient on such a column due to a type that violate 1FN
This can be reduced by using a table instaed of an array
On the poor performance of PG's aggregate
PG use a model of MVCC that combine in the same table data pages with phantom records and valid records, so to count valid record, that's need to red one by one all the records to distinguish wich one are valid to be counted from the other taht must not be count...
Most of other DBMS does not works as PG, like Oracle or SQL Server that does not keep phantom records inside the datapages, and some others have the exact count of the valid rows into the page header...
As a example, read the tests I have done comparing COUNT and other aggregate functions between PG and SQL Server, some queries runs 1500 time faster on SQL Server...

postgresql st_contains performance

SELECT
a.geom, 'tk' category,
ROUND(avg(tk), 1) tk
FROM
tb_grid_4326_100m a left outer join
(
SELECT
tk-273.15 tk, geom
FROM
tb_points
WHERE
hour = '23'
) b ON st_contains(a.geom, b.geom)
GROUP BY
a.geom
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Finalize GroupAggregate (cost=54632324.85..54648025.25 rows=50698 width=184) (actual time=8522.042..8665.129 rows=50698 loops=1) |
Group Key: a.geom |
-> Gather Merge (cost=54632324.85..54646504.31 rows=101396 width=152) (actual time=8522.032..8598.567 rows=50698 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Partial GroupAggregate (cost=54631324.83..54633800.68 rows=50698 width=152) (actual time=8490.577..8512.725 rows=16899 loops=3) |
Group Key: a.geom |
-> Sort (cost=54631324.83..54631785.36 rows=184212 width=130) (actual time=8490.557..8495.249 rows=16996 loops=3) |
Sort Key: a.geom |
Sort Method: external merge Disk: 2296kB |
Worker 0: Sort Method: external merge Disk: 2304kB |
Worker 1: Sort Method: external merge Disk: 2296kB |
-> Nested Loop Left Join (cost=0.41..54602621.56 rows=184212 width=130) (actual time=1.729..8475.942 rows=16996 loops=3) |
-> Parallel Seq Scan on tb_grid_4326_100m a (cost=0.00..5866.24 rows=21124 width=120) (actual time=0.724..2.846 rows=16899 loops=3) |
-> Index Scan using sidx_tb_points on tb_points (cost=0.41..2584.48 rows=10 width=42) (actual time=0.351..0.501 rows=1 loops=50698)|
Index Cond: (((hour)::text = '23'::text) AND (geom # a.geom)) |
Filter: st_contains(a.geom, geom) |
Rows Removed by Filter: 0 |
Planning Time: 1.372 ms |
Execution Time: 8667.418 ms |
I want to join 100m grid table, 100,000 points table using st_contains function.
The 100m grid table has 75,769 records, and tb_points table has 2,434,536 records.
When a time condition is given, the tb_points table returns about 100,000 records.
(As a result, about 75,000 records JOIN about 100,000 records.)
(Index information)
100m grid table using gist(geom),
tb_points table using gist(hour, geom)
It took 30 seconds. How can i imporve the performance?
It is hard to give a definitive answer, but here are several things you can try:
For a multicolumn gist index, it is often a good idea to put the most selectively used column first. In your case, that would have the index be on (geom, hour), not (hour, geom). On the other hand, it can also be better to put the faster column first, and testing for scalar equality should be much faster than testing for containment. You would have to do the test and see which factor is more important for you.
You could try for an Index-only scan, which doesn't need to visit the table. That could save a lot of random IO. Do do that you would need the index gist (hour, geom) INCLUDE (tk, geom). The geom column in a gist index is not considered to be "returnable", so it also needs to be put in the INCLUDE part into order to get the IOS.
Finally, you could partition the table tb_points on "hour". Then you wouldn't need to put "hour" into the gist index, as it is already fulfilled by the partitioning.
And these can be mixed and matched, so you could also swap the column order in the INCLUDE index, or you could try to get both partitioning and the INCLUDE index working together.

Get latest records per timestamp from large table - Index is not used

I have several staging tables where records are inserted/updated (not deleted) regularly.
Each table has a 'BEFORE UPDATE' trigger updating a timestamp column with the current timestamp.
There is a process running periodically fetching the latest records (delta) from each staging table based on a timestamp that is stored in a control table. This is done using a materialized view.
The control table is updated with the max(timestamp) found from the materialized views every time the above process runs
Control table:
id | staging_table_name | input_last_update_timestamp |
---+--------------------+-----------------------------+
1 | stg_table1 | 2018-06-29 12:57:19 |
2 | stg_table2 | 2018-06-29 13:52:19 |
stg_table1
id | internal_timestamp
--------+--------------------
6875303 | 2018-06-29 14:18:17
6874765 | 2018-06-29 14:18:17
6875095 | 2018-06-29 14:18:17
6867996 | 2018-06-29 14:18:17
6873723 | 2018-06-29 14:18:17
6874594 | 2018-06-29 14:18:17
6868561 | 2018-06-29 14:18:17
6875292 | 2018-06-29 14:18:00
6874595 | 2018-06-29 14:18:00
6875300 | 2018-06-29 14:18:00
I have tried the following queries but non of them use the index i have on the 'internal_timestamp' column of the staging table
Query1:
SELECT
p.id,
p.internal_timestamp
FROM
staging_scm.stg_table1 p,
control_staging_scm.control_table o
WHERE
p.internal_timestamp > o.input_last_update_timestamp
AND o.id = 21
Query2
SELECT
p.id,
p.internal_timestamp
FROM
staging_scm.stg_table1 p
JOIN
control_staging_scm.control_table o ON p.internal_timestamp > o.input_last_update_timestamp
WHERE
o.id = 21
Query3
SELECT
p.id,
p.internal_timestamp
FROM
staging_scm.stg_table1 p
WHERE
p.internal_timestamp > (SELECT o.input_last_update_timestamp
FROM control_staging_scm.control_table o
WHERE o.id = 21)
Explain plans:
Query 1 and 2
Nested Loop (cost=0.03..203273.39 rows=1539352 width=12) (actual time=2013.969..2058.475 rows=520 loops=1)
Join Filter: (p.internal_timestamp > o.input_last_update_timestamp)
Rows Removed by Join Filter: 4615088
Buffers: shared hit=173254
-> Index Scan using control_table_pkey on control_table o (cost=0.03..4.03 rows=1 width=8) (actual time=0.011..0.014 rows=1 loops=1)
Index Cond: (id = 21)
Buffers: shared hit=2
-> Seq Scan on stg_table1 p (cost=0.00..187106.17 rows=4618055 width=12) (actual time=0.003..419.628 rows=4615608 loops=1)
Buffers: shared hit=173252
Planning time: 0.110 ms
Execution time: 2058.533 ms
Query 3
Seq Scan on stg_table1 p (cost=4.03..189419.23 rows=1539352 width=12) (actual time=2020.801..2054.617 rows=675 loops=1)
Filter: (internal_timestamp > $0)
Rows Removed by Filter: 4614988
Buffers: shared hit=173254
InitPlan 1 (returns $0)
-> Index Scan using control_table_pkey on control_table o (cost=0.03..4.03 rows=1 width=8) (actual time=0.013..0.014 rows=1 loops=1)
Index Cond: (id = 21)
Buffers: shared hit=2
Planning time: 0.155 ms
Execution time: 2054.694 ms
When I set enable_seqscan = OFF the index is used and the performance is orders of magnitude better
Explain Plan (Seqscan OFF)
Nested Loop (cost=41794.55..225088.07 rows=1539618 width=12) (actual time=0.100..0.557 rows=407 loops=1)
Buffers: shared hit=97
-> Index Scan using control_table_pkey on control_table o (cost=0.03..4.03 rows=1 width=8) (actual time=0.010..0.011 rows=1 loops=1)
Index Cond: (id = 21)
Buffers: shared hit=2
-> Bitmap Heap Scan on stg_table1 p (cost=41794.52..220465.18 rows=1539618 width=12) (actual time=0.085..0.317 rows=407 loops=1)
Recheck Cond: (internal_timestamp > o.input_last_update_timestamp)
Heap Blocks: exact=90
Buffers: shared hit=95
-> Bitmap Index Scan on stg_table1_internal_timestamp_idx (cost=0.00..41717.54 rows=1539618 width=0) (actual time=0.070..0.070 rows=407 loops=1)
Index Cond: (internal_timestamp > o.input_last_update_timestamp)
Buffers: shared hit=5
Planning time: 0.131 ms
Execution time: 0.631 ms
No need to mention that i run Analyze on the staging table and i have set autovacuum/autoanalyze accordingly
So what will it take for the planner to use the index on 'internal_timestamp' on the staging table?
UPDATE 1
Before trying what #Laurenz suggested below, i was curious where a CTE or a scalar function would do the trick.
But unfortunately the optimizer whouldn't use the index in both solutions
CTE
WITH x AS (
SELECT o.input_last_update_timestamp
FROM control_staging_scm.control_table o
WHERE o.id = 21
)
SELECT
p.id,
p.internal_timestamp
FROM
staging_scm.stg_table1 p
WHERE
p.internal_timestamp > (SELECT x.input_last_update_timestamp FROM x)
SCALAR FUNCTION
CREATE OR REPLACE FUNCTION control_staging_scm.last_update_timestamp(_table_id integer)
RETURNS timestamp without time zone
AS $function$
SELECT o.input_last_update_timestamp FROM control_staging_scm.control_table o WHERE o.id = $1;
$function$ LANGUAGE 'sql';
SELECT
p.id,
p.internal_timestamp
FROM
staging_scm.stg_table1 p
WHERE
p.internal_timestamp > (SELECT control_staging_scm.last_update_timestamp(21))
I was expecting/hoping that the value (timestamp) would be calculated and be available to the optimizer before the execution of the main query.
It would be nice if someone pointed out what is the internal pehaviour of the optimizer for the above cases!
The optimizer knows quite well that there will only be one matching row from control_table, but it cannot predict what value the input_last_update_timestamp column will have (that is only known at query execution time), so it has no good way of knowing how many result rows from stg_table1 it should expect.
Lacking this knowledge, it falls back to estimating that one third of the rows will be selected, which is best done with a sequential scan.
You can improve that by splitting the query into two parts:
SELECT o.input_last_update_timestamp
FROM control_staging_scm.control_table o
WHERE o.id = 21;
SELECT p.id, p.internal_timestamp
FROM staging_scm.stg_table1 p
WHERE p.internal_timestamp > <result from first query>;
Then the actual value will be known when the second query is planned, ane PostgreSQL will choose the index scan if only a few rows match the condition.
SOLUTION
As suggested by #Laurenz i tried to separate the two queries and use the result of the 1st one as a parameter on the 2nd query.
I did it using 'plpgsql' function that returns a table
CREATE OR REPLACE FUNCTION control_staging_scm.update_stgtable1_delta_mat_view()
RETURNS TABLE (
trade_id int4
, internal_timestamp timestamp
)
AS $function$
DECLARE
last_update_timestamp_temp_var timestamp WITHOUT time ZONE;
BEGIN
SELECT input_last_update_timestamp into last_update_timestamp_temp_var
FROM control_staging_scm.control_table
WHERE id=21;
RETURN QUERY
SELECT p.trade_id AS tr_id,
p.internal_timestamp AS intr_timestamp,
FROM staging_scm.stg_table1 p
WHERE p.internal_timestamp > last_update_timestamp_temp_var;
END;
$function$ LANGUAGE plpgsql;
SELECT * FROM control_staging_scm.update_stgtable1_delta_mat_view()
Explain Plan
Function Scan on update_stgtable1_delta_mat_view (cost=0.05..3.05 rows=1000 width=640) (actual time=0.828..0.847 rows=321 loops=1)
Planning time: 0.049 ms
Execution time: 0.888 ms
Finally the optimizer choose to use the index (see Seqscan OFF query plan on the question above).
So we ended up with a ~2000x faster query not bad at all:)
Of course if you can suggest a better answer please feel free to do so!

Avoiding external disk sort for aggregate query

We have a table that contains raw analytics (like Google Analytics and similar) numbers for views on our videos. It contains numbers like raw views, downloads, loads, etc. Each video is identified by a video_id.
Data is recorded per-day, but because we need to extract on a number of metrics each day can contain multiple records for a specific video_id. Example:
date | video_id | country | source | downloads | etc...
----------------------------------------------------------------
2014-01-02 | 1 | us | facebook | 10 |
2014-01-02 | 1 | dk | facebook | 13 |
2014-01-02 | 1 | dk | admin | 20 |
I have a query where I need to get aggregate data for all videos that have new data beyond a certain date. To get the video ID's I do this query: SELECT video_id FROM table WHERE date >= '2014-01-01' GROUP BY photo_id (alternatively I could do a DISTINCT(video_id) without a GROUP BY, performance is identical).
Once I have these IDs I need the total aggregate data (for all time). Combined, this turns into the following query:
SELECT
video_id,
SUM(downloads),
SUM(loads),
<more SUMs),
FROM
table
WHERE
video_id IN (SELECT video_id FROM table WHERE date >= '2014-01-01' GROUP BY video_id)
GROUP BY
video_id
There is around ~10 columns we SUM (5-10 depending on the query). The EXPLAIN ANALYZE gives the following:
GroupAggregate (cost=2370840.59..2475948.90 rows=42537 width=72) (actual time=153790.362..162668.962 rows=87661 loops=1)
-> Sort (cost=2370840.59..2378295.16 rows=2981826 width=72) (actual time=153790.329..155833.770 rows=3285001 loops=1)
Sort Key: table.video_id
Sort Method: external merge Disk: 263528kB
-> Hash Join (cost=57066.94..1683266.53 rows=2981826 width=72) (actual time=740.210..143814.921 rows=3285001 loops=1)
Hash Cond: (table.video_id = table.video_id)
-> Seq Scan on table (cost=0.00..1550549.52 rows=5963652 width=72) (actual time=1.768..47613.953 rows=5963652 loops=1)
-> Hash (cost=56924.17..56924.17 rows=11422 width=8) (actual time=734.881..734.881 rows=87661 loops=1)
Buckets: 2048 Batches: 4 (originally 1) Memory Usage: 1025kB
-> HashAggregate (cost=56695.73..56809.95 rows=11422 width=8) (actual time=693.769..715.665 rows=87661 loops=1)
-> Index Only Scan using table_recent_ids on table (cost=0.00..52692.41 rows=1601328 width=8) (actual time=1.279..314.249 rows=1614339 loops=1)
Index Cond: (date >= '2014-01-01'::date)
Heap Fetches: 0
Total runtime: 162693.367 ms
As you can see, it's using a (quite big) external disk merge sort and taking a long time. I am unsure of why the sorts are triggered in the first place, and I am looking for a way to avoid it or at least minimize it. I know increasing work_mem can alleviate external disk merges, but in this case it seems to be excessive and having a work_mem above 500MB seems like a bad idea.
The table has two (relevant) indexes: One on video_id alone and another on (date, video_id).
EDIT: Updated query after running ANALYZE table.
Edited to match the revised query plan.
You are getting a sort because Postgres needs to sort the result rows to group them.
This query looks like it could really benefit from an index on table(video_id, date), or even just an index on table(video_id). Having such an index would likely avoid the need to sort.
Edited (#2) to suggest
You could also consider testing an alternative query such as this:
SELECT
video_id,
MAX(date) as latest_date,
<SUMs>
FROM
table
GROUP BY
video_id
HAVING
latest_date >= '2014-01-01'
That avoids any join or subquery, and given an index on table(video_id [, other columns]) it can be hoped that the sort will be avoided as well. It will compute the sums over the whole base table before filtering out the groups you don't want, but that operation is O(n), whereas sorting is O(m log m). Thus, if the date criterion is not very selective then checking it after the fact may be an improvement.

IN vs OR in the SQL WHERE clause

When dealing with big databases, which performs better: IN or OR in the SQL WHERE clause?
Is there any difference about the way they are executed?
I assume you want to know the performance difference between the following:
WHERE foo IN ('a', 'b', 'c')
WHERE foo = 'a' OR foo = 'b' OR foo = 'c'
According to the manual for MySQL if the values are constant IN sorts the list and then uses a binary search. I would imagine that OR evaluates them one by one in no particular order. So IN is faster in some circumstances.
The best way to know is to profile both on your database with your specific data to see which is faster.
I tried both on a MySQL with 1000000 rows. When the column is indexed there is no discernable difference in performance - both are nearly instant. When the column is not indexed I got these results:
SELECT COUNT(*) FROM t_inner WHERE val IN (1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000);
1 row fetched in 0.0032 (1.2679 seconds)
SELECT COUNT(*) FROM t_inner WHERE val = 1000 OR val = 2000 OR val = 3000 OR val = 4000 OR val = 5000 OR val = 6000 OR val = 7000 OR val = 8000 OR val = 9000;
1 row fetched in 0.0026 (1.7385 seconds)
So in this case the method using OR is about 30% slower. Adding more terms makes the difference larger. Results may vary on other databases and on other data.
The best way to find out is looking at the Execution Plan.
I tried it with Oracle, and it was exactly the same.
CREATE TABLE performance_test AS ( SELECT * FROM dba_objects );
SELECT * FROM performance_test
WHERE object_name IN ('DBMS_STANDARD', 'DBMS_REGISTRY', 'DBMS_LOB' );
Even though the query uses IN, the Execution Plan says that it uses OR:
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8 | 1416 | 163 (2)| 00:00:02 |
|* 1 | TABLE ACCESS FULL| PERFORMANCE_TEST | 8 | 1416 | 163 (2)| 00:00:02 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("OBJECT_NAME"='DBMS_LOB' OR "OBJECT_NAME"='DBMS_REGISTRY' OR
"OBJECT_NAME"='DBMS_STANDARD')
The OR operator needs a much more complex evaluation process than the IN construct because it allows many conditions, not only equals like IN.
Here is a list of what you can use with OR but that are not compatible with IN:
greater, greater or equal, less, less or equal, LIKE and some more like the oracle REGEXP_LIKE.
In addition, consider that the conditions may not always compare the same value.
For the query optimizer it's easier to to manage the IN operator because is only a construct that defines the OR operator on multiple conditions with = operator on the same value. If you use the OR operator the optimizer may not consider that you're always using the = operator on the same value and, if it doesn't perform a deeper and more complex elaboration, it could probably exclude that there may be only = operators for the same values on all the involved conditions, with a consequent preclusion of optimized search methods like the already mentioned binary search.
[EDIT]
Probably an optimizer may not implement optimized IN evaluation process, but this doesn't exclude that one time it could happen(with a database version upgrade). So if you use the OR operator that optimized elaboration will not be used in your case.
I think oracle is smart enough to convert the less efficient one (whichever that is) into the other. So I think the answer should rather depend on the readability of each (where I think that IN clearly wins)
OR makes sense (from readability point of view), when there are less values to be compared.
IN is useful esp. when you have a dynamic source, with which you want values to be compared.
Another alternative is to use a JOIN with a temporary table.
I don't think performance should be a problem, provided you have necessary indexes.
I'll add info for PostgreSQL version 11.8 (released 2020-05-14).
IN may be significantly faster. E.g. table with ~23M rows.
Query with OR:
explain analyse select sum(mnozstvi_rozdil)
from product_erecept
where okres_nazev = 'Brno-město' or okres_nazev = 'Pardubice';
-- execution plan
Finalize Aggregate (cost=725977.36..725977.37 rows=1 width=32) (actual time=4536.796..4540.748 rows=1 loops=1)
-> Gather (cost=725977.14..725977.35 rows=2 width=32) (actual time=4535.010..4540.732 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=724977.14..724977.15 rows=1 width=32) (actual time=4519.338..4519.339 rows=1 loops=3)
-> Parallel Bitmap Heap Scan on product_erecept (cost=15589.71..724264.41 rows=285089 width=4) (actual time=135.832..4410.525 rows=230706 loops=3)
Recheck Cond: (((okres_nazev)::text = 'Brno-město'::text) OR ((okres_nazev)::text = 'Pardubice'::text))
Rows Removed by Index Recheck: 3857398
Heap Blocks: exact=11840 lossy=142202
-> BitmapOr (cost=15589.71..15589.71 rows=689131 width=0) (actual time=140.985..140.986 rows=0 loops=1)
-> Bitmap Index Scan on product_erecept_x_okres_nazev (cost=0.00..8797.61 rows=397606 width=0) (actual time=99.371..99.371 rows=397949 loops=1)
Index Cond: ((okres_nazev)::text = 'Brno-město'::text)
-> Bitmap Index Scan on product_erecept_x_okres_nazev (cost=0.00..6450.00 rows=291525 width=0) (actual time=41.612..41.612 rows=294170 loops=1)
Index Cond: ((okres_nazev)::text = 'Pardubice'::text)
Planning Time: 0.162 ms
Execution Time: 4540.829 ms
Query with IN:
explain analyse select sum(mnozstvi_rozdil)
from product_erecept
where okres_nazev in ('Brno-město', 'Pardubice');
-- execution plan
Aggregate (cost=593199.90..593199.91 rows=1 width=32) (actual time=855.706..855.707 rows=1 loops=1)
-> Index Scan using product_erecept_x_okres_nazev on product_erecept (cost=0.56..591477.07 rows=689131 width=4) (actual time=1.326..645.597 rows=692119 loops=1)
Index Cond: ((okres_nazev)::text = ANY ('{Brno-město,Pardubice}'::text[]))
Planning Time: 0.136 ms
Execution Time: 855.743 ms
Even though you use the IN operator MS SQL server will automatically convert it to OR operator. If you analyzed the execution plans will able to see this. So better to use it OR if its long IN operator list. it will at least save some nanoseconds of the operation.
I did a SQL query in a large number of OR (350). Postgres do it 437.80ms.
Now use IN:
23.18ms