Check if records exists in a Postgres table - sql

I have to read a CSV every 20 seconds. Each CSV contains min. of 500 to max. 60000 lines. I have to insert the data in a Postgres table, but before that I need to check if the items have already been inserted, because there is a high probability of getting duplicate item. The field to check for uniqueness is also indexed.
So, I read the file in chunks and use the IN clause to get the items already in the database.
Is there a better way of doing it?

This should perform well:
CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
COPY tmp FROM '/absolute/path/to/file' FORMAT csv;
INSERT INTO tbl
SELECT tmp.*
FROM tmp
LEFT JOIN tbl USING (tbl_id)
WHERE tbl.tbl_id IS NULL;
DROP TABLE tmp; -- else dropped at end of session automatically
Closely related to this answer.

First just for completeness I changed Erwin's code to use except
CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
COPY tmp FROM '/absolute/path/to/file' FORMAT csv;
INSERT INTO tbl
SELECT tmp.*
FROM tmp
except
select *
from tbl
DROP TABLE tmp;
Then I resolved to test it myself. I tested it in 9.1 with a mostly untouched postgresql.conf. The target table contains 10 million rows and the origin table 30 thousand. 15 thousand already exists in the target table.
create table tbl (id integer primary key)
;
insert into tbl
select generate_series(1, 10000000)
;
create temp table tmp as select * from tbl limit 0
;
insert into tmp
select generate_series(9985000, 10015000)
;
I asked for the explain of the select part only. The except version:
explain
select *
from tmp
except
select *
from tbl
;
QUERY PLAN
----------------------------------------------------------------------------------------
HashSetOp Except (cost=0.00..270098.68 rows=200 width=4)
-> Append (cost=0.00..245018.94 rows=10031897 width=4)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..771.40 rows=31920 width=4)
-> Seq Scan on tmp (cost=0.00..452.20 rows=31920 width=4)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..244247.54 rows=9999977 width=4)
-> Seq Scan on tbl (cost=0.00..144247.77 rows=9999977 width=4)
(6 rows)
The outer join version:
explain
select *
from
tmp
left join
tbl using (id)
where tbl.id is null
;
QUERY PLAN
--------------------------------------------------------------------------
Nested Loop Anti Join (cost=0.00..208142.58 rows=15960 width=4)
-> Seq Scan on tmp (cost=0.00..452.20 rows=31920 width=4)
-> Index Scan using tbl_pkey on tbl (cost=0.00..7.80 rows=1 width=4)
Index Cond: (tmp.id = id)
(4 rows)

Related

Antijoin with postgresql

I have a problem optimizing a query with postgresql 10.4
for example, when I run
select * from t1 where i not in (select j from t2)
I expect pg to use the index on t2.j, but it does not. Here is the plan that I get :
Seq Scan on t1 (cost=169.99..339.99 rows=5000 width=4)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on t2 (cost=0.00..144.99 rows=9999 width=4)
Is pg not able to use indexs for antijoin or is there something obvious that I miss ?
The SQL that I used to create the tables :
create table t1(i integer);
insert into t1(i) select s from generate_series(1, 10000) s;
create table t2(j integer);
insert into t2(j) select s from generate_series(1, 9999) s;
create index index_j on t2(j);
I have a similar problem with tables over 1 million rows, and using table scans just to fetch a few fundreed records is very slow...
thanks,

Missing table access in PostgreSQL query plan

I have two same tables one having 1k rows and the second 1M rows. I use the following script to populate them.
CREATE TABLE Table1 (
id int NOT NULL primary key,
groupby int NOT NULL,
orderby int NOT NULL,
local_search int NOT NULL,
global_search int NOT NULL,
padding varchar(100) NOT NULL
);
CREATE TABLE Table2 (
id int NOT NULL primary key,
groupby int NOT NULL,
orderby int NOT NULL,
local_search int NOT NULL,
global_search int NOT NULL,
padding varchar(100) NOT NULL
);
INSERT
INTO Table1
WITH t1 AS
(
SELECT id
FROM generate_series(1, 10000) id
), t2 AS
(
SELECT id,
id % 100 groupby
FROM t1
), t3 AS
(
SELECT b.id, b.groupby, row_number() over (partition by groupby order by id) orderby
FROM t2 b
)
SELECT id,
groupby,
orderby,
orderby % 50 local_search,
id % 1000 global_search,
RPAD('Value ' || id || ' ' , 100, '*') as padding
FROM t3;
INSERT
INTO Table2
WITH t1 AS
(
SELECT id
FROM generate_series(1, 1000000) id
), t2 AS
(
SELECT id,
id % 100 groupby
FROM t1
), t3 AS
(
SELECT b.id, b.groupby, row_number() over (partition by groupby order by id) orderby
FROM t2 b
)
SELECT id,
groupby,
orderby,
orderby % 50 local_search,
id % 1000 global_search,
RPAD('Value ' || id || ' ' , 100, '*') as padding
FROM t3;
I created also secondary index on table2
CREATE INDEX ix_Table2_groupby_orderby ON Table2 (groupby, orderby);
Now, I have the following query
select b.id, b.groupby, b.orderby, b.local_search, b.global_search, b.padding
from Table2 b
join Table1 a on b.orderby = a.id
where a.global_search = 1 and b.groupby < 10;
which leads to the following query plan using explain(analyze)
"Nested Loop (cost=0.42..17787.05 rows=100 width=121) (actual time=0.056..34.722 rows=100 loops=1)"
" -> Seq Scan on table1 a (cost=0.00..318.00 rows=10 width=4) (actual time=0.033..1.313 rows=10 loops=1)"
" Filter: (global_search = 1)"
" Rows Removed by Filter: 9990"
" -> Index Scan using ix_table2_groupby_orderby on table2 b (cost=0.42..1746.81 rows=10 width=121) (actual time=0.159..3.337 rows=10 loops=10)"
" Index Cond: ((groupby < 10) AND (orderby = a.id))"
"Planning time: 0.296 ms"
"Execution time: 34.775 ms"
and my question is: how it comes that he does not access the table2 in the query plan? He uses just ix_table2_groupby_orderby, but it contains just groupby, orderby and maybe id columns. How he gets the remaining columns of Table2 and why it is not in the query plan?
** EDIT **
I have tried explain(verbose) As suggested #laurenzalbe. This is the result
"Nested Loop (cost=0.42..17787.05 rows=100 width=121) (actual time=0.070..35.678 rows=100 loops=1)"
" Output: b.id, b.groupby, b.orderby, b.local_search, b.global_search, b.padding"
" -> Seq Scan on public.table1 a (cost=0.00..318.00 rows=10 width=4) (actual time=0.031..1.642 rows=10 loops=1)"
" Output: a.id, a.groupby, a.orderby, a.local_search, a.global_search, a.padding"
" Filter: (a.global_search = 1)"
" Rows Removed by Filter: 9990"
" -> Index Scan using ix_table2_groupby_orderby on public.table2 b (cost=0.42..1746.81 rows=10 width=121) (actual time=0.159..3.398 rows=10 loops=10)"
" Output: b.id, b.groupby, b.orderby, b.local_search, b.global_search, b.padding"
" Index Cond: ((b.groupby < 10) AND (b.orderby = a.id))"
"Planning time: 16.201 ms"
"Execution time: 35.754 ms"
Actually, I do not fully understand why the access to the heap of table2 is not there, but I accept it as an answer.
An index scan in PostgreSQL accesses not only the index, but also the table. This is not explicitly shown in the execution plan and is necessary to find out if a row is visible to the transaction or not.
Try EXPLAIN (VERBOSE) to see what columns are returned.
See the documentation for details:
All indexes in PostgreSQL are secondary indexes, meaning that each index is stored separately from the table's main data area (which is called the table's heap in PostgreSQL terminology). This means that in an ordinary index scan, each row retrieval requires fetching data from both the index and the heap.

Postgresql partition into a fixed set of files by ID

Apache Spark has the option to split into multiple files with the bucketBy command. For example if I have 100 million user IDs, I can split the table into 32 different files, where some type of hashing algorithm is used to distribute and lookup the data between files.
Can Postgres split tables into a fixed number of partitions somehow? If it's not a native feature can it still be accomplished, for example generate a hash; turn hash into a number; take modulo % 32 as parititon range.
example with modulo:
a short partitions setup:
db=# create table p(i int);
CREATE TABLE
db=# create table p1 ( check (mod(i,3)=0) ) inherits (p);
CREATE TABLE
db=# create table p2 ( check (mod(i,3)=1) ) inherits (p);
CREATE TABLE
db=# create table p3 ( check (mod(i,3)=2) ) inherits (p);
CREATE TABLE
db=# create rule pir3 AS ON insert to p where mod(i,3) = 2 do instead insert into p3 values (new.*);
CREATE RULE
db=# create rule pir2 AS ON insert to p where mod(i,3) = 1 do instead insert into p2 values (new.*);
CREATE RULE
db=# create rule pir1 AS ON insert to p where mod(i,3) = 0 do instead insert into p1 values (new.*);
CREATE RULE
checking:
db=# insert into p values (1),(2),(3),(4),(5);
INSERT 0 0
db=# select * from p;
i
---
3
1
4
2
5
(5 rows)
db=# select * from p1;
i
---
3
(1 row)
db=# select * from p2;
i
---
1
4
(2 rows)
db=# select * from p3;
i
---
2
5
(2 rows)
https://www.postgresql.org/docs/current/static/tutorial-inheritance.html
https://www.postgresql.org/docs/current/static/ddl-partitioning.html
and demo of partitions working:
db=# explain analyze select * from p where mod(i,3) = 2;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Append (cost=0.00..48.25 rows=14 width=4) (actual time=0.013..0.015 rows=2 loops=1)
-> Seq Scan on p (cost=0.00..0.00 rows=1 width=4) (actual time=0.004..0.004 rows=0 loops=1)
Filter: (mod(i, 3) = 2)
-> Seq Scan on p3 (cost=0.00..48.25 rows=13 width=4) (actual time=0.009..0.011 rows=2 loops=1)
Filter: (mod(i, 3) = 2)
Planning time: 0.203 ms
Execution time: 0.052 ms
(7 rows)

Acquiring row level locks in a order

I have a table where I am updating multiple rows inside a transaction.
DROP SCHEMA IF EXISTS s CASCADE;
CREATE SCHEMA s;
CREATE TABLE s.t1 (
"id1" Bigint,
"id2" Bigint,
CONSTRAINT "pk1" PRIMARY KEY (id1)
)
WITH(OIDS=FALSE);
INSERT INTO s.t1( id1, id2 )
SELECT x, x * 100
FROM generate_series( 1,10 ) x;
END TRANSACTION;
BEGIN TRANSACTION;
SELECT id1 FROM s.t1 WHERE id1 > 3 and id1 < 6 ORDER BY id1 FOR UPDATE; /* row lock */
I am assuming this will take row level locks in order (id1).
Is my assumption correct ?
So that I will be able to run multiple transactions without ever worrying about deadlocks due to the order of locks on rows.
END TRANSACTION;
BEGIN TRANSACTION;
SELECT id1,id2 FROM s.t1 order by id1;
DROP SCHEMA s CASCADE;
I did a explain.
EXPLAIN SELECT id1 FROM s.t1 WHERE id1 > 3 and id1 < 6 ORDER BY id1 FOR UPDATE;
QUERY PLAN
------------------------------------------------------------------------------
LockRows (cost=15.05..15.16 rows=9 width=14)
-> Sort (cost=15.05..15.07 rows=9 width=14)
Sort Key: id1
-> Bitmap Heap Scan on t1 (cost=4.34..14.91 rows=9 width=14)
Recheck Cond: ((id1 > 3) AND (id1 < 6))
-> Bitmap Index Scan on pk1 (cost=0.00..4.34 rows=9 width=0)
Index Cond: ((id1 > 3) AND (id1 < 6))
(7 rows)
Answer: This is correct.
Thanks

How to efficiently select rows having a MIN date in postgres

I need to quickly select a value ( baz ) from the "earliest" ( MIN(save_date) ) rows grouped by an their foo_id. The following query returns the correct rows (well almost, it can return multiples for each foo_id when there are duplicate save_dates).
The foos table contains about 55k rows and the samples table contains about 25 million rows.
CREATE TABLE foos (
foo_id int,
val varchar(40),
# ref_id is a FK, constraint omitted for brevity
ref_id int
)
CREATE TABLE samples (
sample_id int,
save_date date,
baz smallint,
# foo_id is a FK, constraint omitted for brevity
foo_id int
)
WITH foo ( foo_id, val ) AS (
SELECT foo_id, val FROM foos
WHERE foos.ref_id = 1
ORDER BY foos.val ASC
LIMIT 25 OFFSET 0
)
SELECT foo.val, firsts.baz
FROM foo
LEFT JOIN (
SELECT A.baz, A.foo_id
FROM samples A
INNER JOIN (
SELECT foo_id, MIN( save_date ) AS save_date
FROM samples
GROUP BY foo_id
) B
USING ( foo_id, save_date )
) firsts USING ( foo_id )
This query currently takes over 100 seconds; I'd like to see this return in ~1 second (or less!).
How can I write this query to be optimal?
Updated; adding explains:
Obviously the actual query I'm using isn't using tables foo, baz, etc.
The "dumbed down" example query's (from above) explain:
Hash Right Join (cost=337.69..635.47 rows=3 width=100)
Hash Cond: (a.foo_id = foo.foo_id)
CTE foo
-> Limit (cost=71.52..71.53 rows=3 width=102)
-> Sort (cost=71.52..71.53 rows=3 width=102)
Sort Key: foos.val
-> Seq Scan on foos (cost=0.00..71.50 rows=3 width=102)
Filter: (ref_id = 1)
-> Hash Join (cost=265.25..562.90 rows=9 width=6)
Hash Cond: ((a.foo_id = samples.foo_id) AND (a.save_date = (min(samples.save_date))))
-> Seq Scan on samples a (cost=0.00..195.00 rows=1850 width=10)
-> Hash (cost=244.25..244.25 rows=200 width=8)
-> HashAggregate (cost=204.25..224.25 rows=200 width=8)
-> Seq Scan on samples (cost=0.00..195.00 rows=1850 width=8)
-> Hash (cost=0.60..0.60 rows=3 width=102)
-> CTE Scan on foo (cost=0.00..0.60 rows=3 width=102)
If I understand the question, you want windowing.
WITH find_first AS (
SELECT foo_id, baz,
row_number()
OVER (PARTITION BY foo_id ORDER BY foo_id, save_date) AS rnum
FROM samples
)
SELECT foo_id, baz FROM find_first WHERE rnum = 1;
Using row_number instead of rank eliminates duplicates and guarantees only one baz per foo. If you need to know against foos that have no bazzes, just LEFT JOIN the foos table to this query.
With an index on (foo_id, save_date), the optimizer should be smart enough to do the grouping keeping only one baz and skipping merrily along.
row_number() is a beautiful beast, but DISTINCT ON is simpler here.
WITH foo AS (
SELECT foo_id
FROM foos
WHERE ref_id = 1
ORDER BY val
LIMIT 25 OFFSET 0
)
SELECT DISTINCT ON (1) f.foo_id, s.baz
FROM foo f
LEFT JOIN samples s USING (foo_id)
ORDER BY f.foo_id, s.save_date, s.baz;
This is assuming you want exactly 1 row per foo_id. If there are multiple rows in sample sharing the same earliest save_date, baz serves as tie-breaker.
The case is very similar to this question from yesterday.
More advice:
Don't select val in the CTE, you only need it in ORDER BY.
To avoid expensive sequential scans on foos:
If you are always after rows from foos with ref_id = 1, create a partial multi-column index:
CREATE INDEX foos_val_part_idx ON foos (val)
WHERE ref_id = 1;
If ref_id is variable:
CREATE INDEX foos_ref_id_val_idx ON foos (ref_id, val);
The other index that would help best on samples:
CREATE INDEX samples_foo_id_save_date_baz_idx
ON samples (foo_id, save_date, baz);
These indexes become even more effective with the new "index-only scans" in version 9.2. Details and links here.