I have a problem optimizing a query with postgresql 10.4
for example, when I run
select * from t1 where i not in (select j from t2)
I expect pg to use the index on t2.j, but it does not. Here is the plan that I get :
Seq Scan on t1 (cost=169.99..339.99 rows=5000 width=4)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on t2 (cost=0.00..144.99 rows=9999 width=4)
Is pg not able to use indexs for antijoin or is there something obvious that I miss ?
The SQL that I used to create the tables :
create table t1(i integer);
insert into t1(i) select s from generate_series(1, 10000) s;
create table t2(j integer);
insert into t2(j) select s from generate_series(1, 9999) s;
create index index_j on t2(j);
I have a similar problem with tables over 1 million rows, and using table scans just to fetch a few fundreed records is very slow...
thanks,
I have two same tables one having 1k rows and the second 1M rows. I use the following script to populate them.
CREATE TABLE Table1 (
id int NOT NULL primary key,
groupby int NOT NULL,
orderby int NOT NULL,
local_search int NOT NULL,
global_search int NOT NULL,
padding varchar(100) NOT NULL
);
CREATE TABLE Table2 (
id int NOT NULL primary key,
groupby int NOT NULL,
orderby int NOT NULL,
local_search int NOT NULL,
global_search int NOT NULL,
padding varchar(100) NOT NULL
);
INSERT
INTO Table1
WITH t1 AS
(
SELECT id
FROM generate_series(1, 10000) id
), t2 AS
(
SELECT id,
id % 100 groupby
FROM t1
), t3 AS
(
SELECT b.id, b.groupby, row_number() over (partition by groupby order by id) orderby
FROM t2 b
)
SELECT id,
groupby,
orderby,
orderby % 50 local_search,
id % 1000 global_search,
RPAD('Value ' || id || ' ' , 100, '*') as padding
FROM t3;
INSERT
INTO Table2
WITH t1 AS
(
SELECT id
FROM generate_series(1, 1000000) id
), t2 AS
(
SELECT id,
id % 100 groupby
FROM t1
), t3 AS
(
SELECT b.id, b.groupby, row_number() over (partition by groupby order by id) orderby
FROM t2 b
)
SELECT id,
groupby,
orderby,
orderby % 50 local_search,
id % 1000 global_search,
RPAD('Value ' || id || ' ' , 100, '*') as padding
FROM t3;
I created also secondary index on table2
CREATE INDEX ix_Table2_groupby_orderby ON Table2 (groupby, orderby);
Now, I have the following query
select b.id, b.groupby, b.orderby, b.local_search, b.global_search, b.padding
from Table2 b
join Table1 a on b.orderby = a.id
where a.global_search = 1 and b.groupby < 10;
which leads to the following query plan using explain(analyze)
"Nested Loop (cost=0.42..17787.05 rows=100 width=121) (actual time=0.056..34.722 rows=100 loops=1)"
" -> Seq Scan on table1 a (cost=0.00..318.00 rows=10 width=4) (actual time=0.033..1.313 rows=10 loops=1)"
" Filter: (global_search = 1)"
" Rows Removed by Filter: 9990"
" -> Index Scan using ix_table2_groupby_orderby on table2 b (cost=0.42..1746.81 rows=10 width=121) (actual time=0.159..3.337 rows=10 loops=10)"
" Index Cond: ((groupby < 10) AND (orderby = a.id))"
"Planning time: 0.296 ms"
"Execution time: 34.775 ms"
and my question is: how it comes that he does not access the table2 in the query plan? He uses just ix_table2_groupby_orderby, but it contains just groupby, orderby and maybe id columns. How he gets the remaining columns of Table2 and why it is not in the query plan?
** EDIT **
I have tried explain(verbose) As suggested #laurenzalbe. This is the result
"Nested Loop (cost=0.42..17787.05 rows=100 width=121) (actual time=0.070..35.678 rows=100 loops=1)"
" Output: b.id, b.groupby, b.orderby, b.local_search, b.global_search, b.padding"
" -> Seq Scan on public.table1 a (cost=0.00..318.00 rows=10 width=4) (actual time=0.031..1.642 rows=10 loops=1)"
" Output: a.id, a.groupby, a.orderby, a.local_search, a.global_search, a.padding"
" Filter: (a.global_search = 1)"
" Rows Removed by Filter: 9990"
" -> Index Scan using ix_table2_groupby_orderby on public.table2 b (cost=0.42..1746.81 rows=10 width=121) (actual time=0.159..3.398 rows=10 loops=10)"
" Output: b.id, b.groupby, b.orderby, b.local_search, b.global_search, b.padding"
" Index Cond: ((b.groupby < 10) AND (b.orderby = a.id))"
"Planning time: 16.201 ms"
"Execution time: 35.754 ms"
Actually, I do not fully understand why the access to the heap of table2 is not there, but I accept it as an answer.
An index scan in PostgreSQL accesses not only the index, but also the table. This is not explicitly shown in the execution plan and is necessary to find out if a row is visible to the transaction or not.
Try EXPLAIN (VERBOSE) to see what columns are returned.
See the documentation for details:
All indexes in PostgreSQL are secondary indexes, meaning that each index is stored separately from the table's main data area (which is called the table's heap in PostgreSQL terminology). This means that in an ordinary index scan, each row retrieval requires fetching data from both the index and the heap.
I have a table where I am updating multiple rows inside a transaction.
DROP SCHEMA IF EXISTS s CASCADE;
CREATE SCHEMA s;
CREATE TABLE s.t1 (
"id1" Bigint,
"id2" Bigint,
CONSTRAINT "pk1" PRIMARY KEY (id1)
)
WITH(OIDS=FALSE);
INSERT INTO s.t1( id1, id2 )
SELECT x, x * 100
FROM generate_series( 1,10 ) x;
END TRANSACTION;
BEGIN TRANSACTION;
SELECT id1 FROM s.t1 WHERE id1 > 3 and id1 < 6 ORDER BY id1 FOR UPDATE; /* row lock */
I am assuming this will take row level locks in order (id1).
Is my assumption correct ?
So that I will be able to run multiple transactions without ever worrying about deadlocks due to the order of locks on rows.
END TRANSACTION;
BEGIN TRANSACTION;
SELECT id1,id2 FROM s.t1 order by id1;
DROP SCHEMA s CASCADE;
I did a explain.
EXPLAIN SELECT id1 FROM s.t1 WHERE id1 > 3 and id1 < 6 ORDER BY id1 FOR UPDATE;
QUERY PLAN
------------------------------------------------------------------------------
LockRows (cost=15.05..15.16 rows=9 width=14)
-> Sort (cost=15.05..15.07 rows=9 width=14)
Sort Key: id1
-> Bitmap Heap Scan on t1 (cost=4.34..14.91 rows=9 width=14)
Recheck Cond: ((id1 > 3) AND (id1 < 6))
-> Bitmap Index Scan on pk1 (cost=0.00..4.34 rows=9 width=0)
Index Cond: ((id1 > 3) AND (id1 < 6))
(7 rows)
Answer: This is correct.
Thanks
Given this partial index:
CREATE INDEX orders_id_created_at_index
ON orders(id) WHERE created_at < '2013-12-31';
Would this query use the index?
SELECT *
FROM orders
WHERE id = 123 AND created_at = '2013-10-12';
As per the documentation, "a partial index can be used in a query only if the system can recognize that the WHERE condition of the query mathematically implies the predicate of the index".
Does that mean that the index will or will not be used?
You can check and yes, it would be used. I've created sql fiddle to check it with a query like this:
create table orders(id int, created_at date);
CREATE INDEX orders_id_created_at_index ON orders(id) WHERE created_at < '2013-12-31';
insert into orders
select
(random()*500)::int, '2013-01-01'::date + ((random() * 200)::int || ' day')::interval
from generate_series(1, 10000) as g
SELECT * FROM orders WHERE id = 123 AND created_at = '2013-10-12';
SELECT * FROM orders WHERE id = 123 AND created_at = '2014-10-12';
sql fiddle demo
If you check execution plans for these queries, you'll see for first query:
Bitmap Heap Scan on orders (cost=4.39..40.06 rows=1 width=8) Recheck Cond: ((id = 123) AND (created_at < '2013-12-31'::date)) Filter: (created_at = '2013-10-12'::date)
-> Bitmap Index Scan on orders_id_created_at_index (cost=0.00..4.39 rows=19 width=0) Index Cond: (id = 123)
and for second query:
Seq Scan on orders (cost=0.00..195.00 rows=1 width=8) Filter: ((id = 123) AND (created_at = '2014-10-12'::date))
I have to read a CSV every 20 seconds. Each CSV contains min. of 500 to max. 60000 lines. I have to insert the data in a Postgres table, but before that I need to check if the items have already been inserted, because there is a high probability of getting duplicate item. The field to check for uniqueness is also indexed.
So, I read the file in chunks and use the IN clause to get the items already in the database.
Is there a better way of doing it?
This should perform well:
CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
COPY tmp FROM '/absolute/path/to/file' FORMAT csv;
INSERT INTO tbl
SELECT tmp.*
FROM tmp
LEFT JOIN tbl USING (tbl_id)
WHERE tbl.tbl_id IS NULL;
DROP TABLE tmp; -- else dropped at end of session automatically
Closely related to this answer.
First just for completeness I changed Erwin's code to use except
CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
COPY tmp FROM '/absolute/path/to/file' FORMAT csv;
INSERT INTO tbl
SELECT tmp.*
FROM tmp
except
select *
from tbl
DROP TABLE tmp;
Then I resolved to test it myself. I tested it in 9.1 with a mostly untouched postgresql.conf. The target table contains 10 million rows and the origin table 30 thousand. 15 thousand already exists in the target table.
create table tbl (id integer primary key)
;
insert into tbl
select generate_series(1, 10000000)
;
create temp table tmp as select * from tbl limit 0
;
insert into tmp
select generate_series(9985000, 10015000)
;
I asked for the explain of the select part only. The except version:
explain
select *
from tmp
except
select *
from tbl
;
QUERY PLAN
----------------------------------------------------------------------------------------
HashSetOp Except (cost=0.00..270098.68 rows=200 width=4)
-> Append (cost=0.00..245018.94 rows=10031897 width=4)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..771.40 rows=31920 width=4)
-> Seq Scan on tmp (cost=0.00..452.20 rows=31920 width=4)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..244247.54 rows=9999977 width=4)
-> Seq Scan on tbl (cost=0.00..144247.77 rows=9999977 width=4)
(6 rows)
The outer join version:
explain
select *
from
tmp
left join
tbl using (id)
where tbl.id is null
;
QUERY PLAN
--------------------------------------------------------------------------
Nested Loop Anti Join (cost=0.00..208142.58 rows=15960 width=4)
-> Seq Scan on tmp (cost=0.00..452.20 rows=31920 width=4)
-> Index Scan using tbl_pkey on tbl (cost=0.00..7.80 rows=1 width=4)
Index Cond: (tmp.id = id)
(4 rows)