Antijoin with postgresql - sql

I have a problem optimizing a query with postgresql 10.4
for example, when I run
select * from t1 where i not in (select j from t2)
I expect pg to use the index on t2.j, but it does not. Here is the plan that I get :
Seq Scan on t1 (cost=169.99..339.99 rows=5000 width=4)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on t2 (cost=0.00..144.99 rows=9999 width=4)
Is pg not able to use indexs for antijoin or is there something obvious that I miss ?
The SQL that I used to create the tables :
create table t1(i integer);
insert into t1(i) select s from generate_series(1, 10000) s;
create table t2(j integer);
insert into t2(j) select s from generate_series(1, 9999) s;
create index index_j on t2(j);
I have a similar problem with tables over 1 million rows, and using table scans just to fetch a few fundreed records is very slow...
thanks,

Related

Separate PostgreSQL partitions join

I'm using PostgreSQL 10.6. I have several tables partitioned by day. Each day has its own data. I want to select rows from this tables within a day.
drop table IF EXISTS request;
drop table IF EXISTS request_identity;
CREATE TABLE IF NOT EXISTS request (
id bigint not null,
record_date date not null,
payload text not null
) PARTITION BY LIST (record_date);
CREATE TABLE IF NOT EXISTS request_p1 PARTITION OF request FOR VALUES IN ('2001-01-01');
CREATE TABLE IF NOT EXISTS request_p2 PARTITION OF request FOR VALUES IN ('2001-01-02');
CREATE INDEX IF NOT EXISTS i_request_p1_id ON request_p1 (id);
CREATE INDEX IF NOT EXISTS i_request_p2_id ON request_p2 (id);
do $$
begin
for i in 1..100000 loop
INSERT INTO request (id,record_date,payload) values (i, '2001-01-01', 'abc');
end loop;
for i in 100001..200000 loop
INSERT INTO request (id,record_date,payload) values (i, '2001-01-02', 'abc');
end loop;
end;
$$;
CREATE TABLE IF NOT EXISTS request_identity (
record_date date not null,
parent_id bigint NOT NULL,
identity_name varchar(32),
identity_value varchar(32)
) PARTITION BY LIST (record_date);
CREATE TABLE IF NOT EXISTS request_identity_p1 PARTITION OF request_identity FOR VALUES IN ('2001-01-01');
CREATE TABLE IF NOT EXISTS request_identity_p2 PARTITION OF request_identity FOR VALUES IN ('2001-01-02');
CREATE INDEX IF NOT EXISTS i_request_identity_p1_payload ON request_identity_p1 (identity_name, identity_value);
CREATE INDEX IF NOT EXISTS i_request_identity_p2_payload ON request_identity_p2 (identity_name, identity_value);
do $$
begin
for i in 1..100000 loop
INSERT INTO request_identity (parent_id,record_date,identity_name,identity_value) values (i, '2001-01-01', 'NAME', 'somename'||i);
end loop;
for i in 100001..200000 loop
INSERT INTO request_identity (parent_id,record_date,identity_name,identity_value) values (i, '2001-01-02', 'NAME', 'somename'||i);
end loop;
end;
$$;
analyze request;
analyze request_identity;
I make select inside 1 day and see a good request plan:
explain analyze select *
from request
where record_date between '2001-01-01' and '2001-01-01'
and exists (select * from request_identity where parent_id = id and identity_name = 'NAME' and identity_value = 'somename555' and record_date between '2001-01-01' and '2001-01-01')
limit 100;
Limit (cost=8.74..16.78 rows=1 width=16)
-> Nested Loop (cost=8.74..16.78 rows=1 width=16)
-> HashAggregate (cost=8.45..8.46 rows=1 width=8)
Group Key: request_identity_p1.parent_id
-> Append (cost=0.42..8.44 rows=1 width=8)
-> Index Scan using i_request_identity_p1_payload on request_identity_p1 (cost=0.42..8.44 rows=1 width=8)
Index Cond: (((identity_name)::text = 'NAME'::text) AND ((identity_value)::text = 'somename555'::text))
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-01'::date))
-> Append (cost=0.29..8.32 rows=1 width=16)
-> Index Scan using i_request_p1_id on request_p1 (cost=0.29..8.32 rows=1 width=16)
Index Cond: (id = request_identity_p1.parent_id)
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-01'::date))
But if I make a select for 2 days or more, then PostgreSQL first appends rows of all partitions of request_identity and all partitions of request, and then joins them.
So this is the SQL that is not working as i want:
explain analyze select *
from request
where record_date between '2001-01-01' and '2001-01-02'
and exists (select * from request_identity where parent_id = id and identity_name = 'NAME' and identity_value = 'somename1777' and record_date between '2001-01-01' and '2001-01-02')
limit 100;
Limit (cost=17.19..50.21 rows=2 width=16)
-> Nested Loop (cost=17.19..50.21 rows=2 width=16)
-> Unique (cost=16.90..16.91 rows=2 width=8)
-> Sort (cost=16.90..16.90 rows=2 width=8)
Sort Key: request_identity_p1.parent_id
-> Append (cost=0.42..16.89 rows=2 width=8)
-> Index Scan using i_request_identity_p1_payload on request_identity_p1 (cost=0.42..8.44 rows=1 width=8)
Index Cond: (((identity_name)::text = 'NAME'::text) AND ((identity_value)::text = 'somename1777'::text))
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-02'::date))
-> Index Scan using i_request_identity_p2_payload on request_identity_p2 (cost=0.42..8.44 rows=1 width=8)
Index Cond: (((identity_name)::text = 'NAME'::text) AND ((identity_value)::text = 'somename1777'::text))
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-02'::date))
-> Append (cost=0.29..16.63 rows=2 width=16)
-> Index Scan using i_request_p1_id on request_p1 (cost=0.29..8.32 rows=1 width=16)
Index Cond: (id = request_identity_p1.parent_id)
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-02'::date))
-> Index Scan using i_request_p2_id on request_p2 (cost=0.29..8.32 rows=1 width=16)
Index Cond: (id = request_identity_p1.parent_id)
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-02'::date))
In my case it doesn't make sense to join (with nested loops) of these appends since the consistent rows are only within 1 day partitions group.
The desired result for me is that PostgreSQL makes joins between request_p1 to request_identity_p1, and request_p2 to request_identity_p2 first and only after that is makes appends of results.
The question is:
Is there a way to perform joins between partitions separately within 1 day partitions group?
Thanks.

Postgresql partition into a fixed set of files by ID

Apache Spark has the option to split into multiple files with the bucketBy command. For example if I have 100 million user IDs, I can split the table into 32 different files, where some type of hashing algorithm is used to distribute and lookup the data between files.
Can Postgres split tables into a fixed number of partitions somehow? If it's not a native feature can it still be accomplished, for example generate a hash; turn hash into a number; take modulo % 32 as parititon range.
example with modulo:
a short partitions setup:
db=# create table p(i int);
CREATE TABLE
db=# create table p1 ( check (mod(i,3)=0) ) inherits (p);
CREATE TABLE
db=# create table p2 ( check (mod(i,3)=1) ) inherits (p);
CREATE TABLE
db=# create table p3 ( check (mod(i,3)=2) ) inherits (p);
CREATE TABLE
db=# create rule pir3 AS ON insert to p where mod(i,3) = 2 do instead insert into p3 values (new.*);
CREATE RULE
db=# create rule pir2 AS ON insert to p where mod(i,3) = 1 do instead insert into p2 values (new.*);
CREATE RULE
db=# create rule pir1 AS ON insert to p where mod(i,3) = 0 do instead insert into p1 values (new.*);
CREATE RULE
checking:
db=# insert into p values (1),(2),(3),(4),(5);
INSERT 0 0
db=# select * from p;
i
---
3
1
4
2
5
(5 rows)
db=# select * from p1;
i
---
3
(1 row)
db=# select * from p2;
i
---
1
4
(2 rows)
db=# select * from p3;
i
---
2
5
(2 rows)
https://www.postgresql.org/docs/current/static/tutorial-inheritance.html
https://www.postgresql.org/docs/current/static/ddl-partitioning.html
and demo of partitions working:
db=# explain analyze select * from p where mod(i,3) = 2;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Append (cost=0.00..48.25 rows=14 width=4) (actual time=0.013..0.015 rows=2 loops=1)
-> Seq Scan on p (cost=0.00..0.00 rows=1 width=4) (actual time=0.004..0.004 rows=0 loops=1)
Filter: (mod(i, 3) = 2)
-> Seq Scan on p3 (cost=0.00..48.25 rows=13 width=4) (actual time=0.009..0.011 rows=2 loops=1)
Filter: (mod(i, 3) = 2)
Planning time: 0.203 ms
Execution time: 0.052 ms
(7 rows)

Acquiring row level locks in a order

I have a table where I am updating multiple rows inside a transaction.
DROP SCHEMA IF EXISTS s CASCADE;
CREATE SCHEMA s;
CREATE TABLE s.t1 (
"id1" Bigint,
"id2" Bigint,
CONSTRAINT "pk1" PRIMARY KEY (id1)
)
WITH(OIDS=FALSE);
INSERT INTO s.t1( id1, id2 )
SELECT x, x * 100
FROM generate_series( 1,10 ) x;
END TRANSACTION;
BEGIN TRANSACTION;
SELECT id1 FROM s.t1 WHERE id1 > 3 and id1 < 6 ORDER BY id1 FOR UPDATE; /* row lock */
I am assuming this will take row level locks in order (id1).
Is my assumption correct ?
So that I will be able to run multiple transactions without ever worrying about deadlocks due to the order of locks on rows.
END TRANSACTION;
BEGIN TRANSACTION;
SELECT id1,id2 FROM s.t1 order by id1;
DROP SCHEMA s CASCADE;
I did a explain.
EXPLAIN SELECT id1 FROM s.t1 WHERE id1 > 3 and id1 < 6 ORDER BY id1 FOR UPDATE;
QUERY PLAN
------------------------------------------------------------------------------
LockRows (cost=15.05..15.16 rows=9 width=14)
-> Sort (cost=15.05..15.07 rows=9 width=14)
Sort Key: id1
-> Bitmap Heap Scan on t1 (cost=4.34..14.91 rows=9 width=14)
Recheck Cond: ((id1 > 3) AND (id1 < 6))
-> Bitmap Index Scan on pk1 (cost=0.00..4.34 rows=9 width=0)
Index Cond: ((id1 > 3) AND (id1 < 6))
(7 rows)
Answer: This is correct.
Thanks

Check if records exists in a Postgres table

I have to read a CSV every 20 seconds. Each CSV contains min. of 500 to max. 60000 lines. I have to insert the data in a Postgres table, but before that I need to check if the items have already been inserted, because there is a high probability of getting duplicate item. The field to check for uniqueness is also indexed.
So, I read the file in chunks and use the IN clause to get the items already in the database.
Is there a better way of doing it?
This should perform well:
CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
COPY tmp FROM '/absolute/path/to/file' FORMAT csv;
INSERT INTO tbl
SELECT tmp.*
FROM tmp
LEFT JOIN tbl USING (tbl_id)
WHERE tbl.tbl_id IS NULL;
DROP TABLE tmp; -- else dropped at end of session automatically
Closely related to this answer.
First just for completeness I changed Erwin's code to use except
CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
COPY tmp FROM '/absolute/path/to/file' FORMAT csv;
INSERT INTO tbl
SELECT tmp.*
FROM tmp
except
select *
from tbl
DROP TABLE tmp;
Then I resolved to test it myself. I tested it in 9.1 with a mostly untouched postgresql.conf. The target table contains 10 million rows and the origin table 30 thousand. 15 thousand already exists in the target table.
create table tbl (id integer primary key)
;
insert into tbl
select generate_series(1, 10000000)
;
create temp table tmp as select * from tbl limit 0
;
insert into tmp
select generate_series(9985000, 10015000)
;
I asked for the explain of the select part only. The except version:
explain
select *
from tmp
except
select *
from tbl
;
QUERY PLAN
----------------------------------------------------------------------------------------
HashSetOp Except (cost=0.00..270098.68 rows=200 width=4)
-> Append (cost=0.00..245018.94 rows=10031897 width=4)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..771.40 rows=31920 width=4)
-> Seq Scan on tmp (cost=0.00..452.20 rows=31920 width=4)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..244247.54 rows=9999977 width=4)
-> Seq Scan on tbl (cost=0.00..144247.77 rows=9999977 width=4)
(6 rows)
The outer join version:
explain
select *
from
tmp
left join
tbl using (id)
where tbl.id is null
;
QUERY PLAN
--------------------------------------------------------------------------
Nested Loop Anti Join (cost=0.00..208142.58 rows=15960 width=4)
-> Seq Scan on tmp (cost=0.00..452.20 rows=31920 width=4)
-> Index Scan using tbl_pkey on tbl (cost=0.00..7.80 rows=1 width=4)
Index Cond: (tmp.id = id)
(4 rows)

How to efficiently select rows having a MIN date in postgres

I need to quickly select a value ( baz ) from the "earliest" ( MIN(save_date) ) rows grouped by an their foo_id. The following query returns the correct rows (well almost, it can return multiples for each foo_id when there are duplicate save_dates).
The foos table contains about 55k rows and the samples table contains about 25 million rows.
CREATE TABLE foos (
foo_id int,
val varchar(40),
# ref_id is a FK, constraint omitted for brevity
ref_id int
)
CREATE TABLE samples (
sample_id int,
save_date date,
baz smallint,
# foo_id is a FK, constraint omitted for brevity
foo_id int
)
WITH foo ( foo_id, val ) AS (
SELECT foo_id, val FROM foos
WHERE foos.ref_id = 1
ORDER BY foos.val ASC
LIMIT 25 OFFSET 0
)
SELECT foo.val, firsts.baz
FROM foo
LEFT JOIN (
SELECT A.baz, A.foo_id
FROM samples A
INNER JOIN (
SELECT foo_id, MIN( save_date ) AS save_date
FROM samples
GROUP BY foo_id
) B
USING ( foo_id, save_date )
) firsts USING ( foo_id )
This query currently takes over 100 seconds; I'd like to see this return in ~1 second (or less!).
How can I write this query to be optimal?
Updated; adding explains:
Obviously the actual query I'm using isn't using tables foo, baz, etc.
The "dumbed down" example query's (from above) explain:
Hash Right Join (cost=337.69..635.47 rows=3 width=100)
Hash Cond: (a.foo_id = foo.foo_id)
CTE foo
-> Limit (cost=71.52..71.53 rows=3 width=102)
-> Sort (cost=71.52..71.53 rows=3 width=102)
Sort Key: foos.val
-> Seq Scan on foos (cost=0.00..71.50 rows=3 width=102)
Filter: (ref_id = 1)
-> Hash Join (cost=265.25..562.90 rows=9 width=6)
Hash Cond: ((a.foo_id = samples.foo_id) AND (a.save_date = (min(samples.save_date))))
-> Seq Scan on samples a (cost=0.00..195.00 rows=1850 width=10)
-> Hash (cost=244.25..244.25 rows=200 width=8)
-> HashAggregate (cost=204.25..224.25 rows=200 width=8)
-> Seq Scan on samples (cost=0.00..195.00 rows=1850 width=8)
-> Hash (cost=0.60..0.60 rows=3 width=102)
-> CTE Scan on foo (cost=0.00..0.60 rows=3 width=102)
If I understand the question, you want windowing.
WITH find_first AS (
SELECT foo_id, baz,
row_number()
OVER (PARTITION BY foo_id ORDER BY foo_id, save_date) AS rnum
FROM samples
)
SELECT foo_id, baz FROM find_first WHERE rnum = 1;
Using row_number instead of rank eliminates duplicates and guarantees only one baz per foo. If you need to know against foos that have no bazzes, just LEFT JOIN the foos table to this query.
With an index on (foo_id, save_date), the optimizer should be smart enough to do the grouping keeping only one baz and skipping merrily along.
row_number() is a beautiful beast, but DISTINCT ON is simpler here.
WITH foo AS (
SELECT foo_id
FROM foos
WHERE ref_id = 1
ORDER BY val
LIMIT 25 OFFSET 0
)
SELECT DISTINCT ON (1) f.foo_id, s.baz
FROM foo f
LEFT JOIN samples s USING (foo_id)
ORDER BY f.foo_id, s.save_date, s.baz;
This is assuming you want exactly 1 row per foo_id. If there are multiple rows in sample sharing the same earliest save_date, baz serves as tie-breaker.
The case is very similar to this question from yesterday.
More advice:
Don't select val in the CTE, you only need it in ORDER BY.
To avoid expensive sequential scans on foos:
If you are always after rows from foos with ref_id = 1, create a partial multi-column index:
CREATE INDEX foos_val_part_idx ON foos (val)
WHERE ref_id = 1;
If ref_id is variable:
CREATE INDEX foos_ref_id_val_idx ON foos (ref_id, val);
The other index that would help best on samples:
CREATE INDEX samples_foo_id_save_date_baz_idx
ON samples (foo_id, save_date, baz);
These indexes become even more effective with the new "index-only scans" in version 9.2. Details and links here.