Optimize Postgres deletion of orphaned records - sql

Take the following two tables:
Table "public.contacts"
Column | Type | Modifiers | Storage | Stats target | Description
--------------------+-----------------------------+-------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('contacts_id_seq'::regclass) | plain | |
created_at | timestamp without time zone | not null | plain | |
updated_at | timestamp without time zone | not null | plain | |
external_id | integer | | plain | |
email_address | character varying | | extended | |
first_name | character varying | | extended | |
last_name | character varying | | extended | |
company | character varying | | extended | |
industry | character varying | | extended | |
country | character varying | | extended | |
region | character varying | | extended | |
ext_instance_id | integer | | plain | |
title | character varying | | extended | |
Indexes:
"contacts_pkey" PRIMARY KEY, btree (id)
"index_contacts_on_ext_instance_id_and_external_id" UNIQUE, btree (ext_instance_id, external_id)
and
Table "public.members"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------------+-----------------------------+--------------------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('members_id_seq'::regclass) | plain | |
step_id | integer | | plain | |
contact_id | integer | | plain | |
rule_id | integer | | plain | |
request_id | integer | | plain | |
sync_id | integer | | plain | |
status | integer | not null default 0 | plain | |
matched_targeted_rule | boolean | default false | plain | |
external_fields | jsonb | | extended | |
imported_at | timestamp without time zone | | plain | |
campaign_id | integer | | plain | |
ext_instance_id | integer | | plain | |
created_at | timestamp without time zone | | plain | |
Indexes:
"members_pkey" PRIMARY KEY, btree (id)
"index_members_on_contact_id_and_step_id" UNIQUE, btree (contact_id, step_id)
"index_members_on_campaign_id" btree (campaign_id)
"index_members_on_step_id" btree (step_id)
"index_members_on_sync_id" btree (sync_id)
"index_members_on_request_id" btree (request_id)
"index_members_on_status" btree (status)
Indices exist for both primary keys and members.contact_id.
I need to delete any contact which has no related members. There are roughly 3MM contact and 25MM member records.
I'm attempting the following two queries:
Query 1:
DELETE FROM "contacts"
WHERE "contacts"."id" IN (SELECT "contacts"."id"
FROM "contacts"
LEFT OUTER JOIN members
ON
members.contact_id = contacts.id
WHERE members.id IS NULL);
DELETE 0
Time: 173033.801 ms
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2654306.79..2654307.86 rows=1 width=18) (actual time=188717.354..188717.354 rows=0 loops=1)
-> Nested Loop (cost=2654306.79..2654307.86 rows=1 width=18) (actual time=188717.351..188717.351 rows=0 loops=1)
-> HashAggregate (cost=2654306.36..2654306.37 rows=1 width=16) (actual time=188717.349..188717.349 rows=0 loops=1)
Group Key: contacts_1.id
-> Hash Right Join (cost=161177.46..2654306.36 rows=1 width=16) (actual time=188717.345..188717.345 rows=0 loops=1)
Hash Cond: (members.contact_id = contacts_1.id)
Filter: (members.id IS NULL)
Rows Removed by Filter: 26725870
-> Seq Scan on members (cost=0.00..1818698.96 rows=25322396 width=14) (actual time=0.043..160226.686 rows=26725870 loops=1)
-> Hash (cost=105460.65..105460.65 rows=3205265 width=10) (actual time=1962.612..1962.612 rows=3196180 loops=1)
Buckets: 262144 Batches: 4 Memory Usage: 34361kB
-> Seq Scan on contacts contacts_1 (cost=0.00..105460.65 rows=3205265 width=10) (actual time=0.011..950.657 rows=3196180 loops=1)
-> Index Scan using contacts_pkey on contacts (cost=0.43..1.48 rows=1 width=10) (never executed)
Index Cond: (id = contacts_1.id)
Planning time: 0.488 ms
Execution time: 188718.862 ms
Query 2:
DELETE FROM contacts
WHERE NOT EXISTS (SELECT 1
FROM members c
WHERE c.contact_id = contacts.id);
DELETE 0
Time: 170871.219 ms
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2258873.91..2954594.50 rows=1895601 width=12) (actual time=177523.034..177523.034 rows=0 loops=1)
-> Hash Anti Join (cost=2258873.91..2954594.50 rows=1895601 width=12) (actual time=177523.029..177523.029 rows=0 loops=1)
Hash Cond: (contacts.id = c.contact_id)
-> Seq Scan on contacts (cost=0.00..105460.65 rows=3205265 width=10) (actual time=0.018..1068.357 rows=3196180 loops=1)
-> Hash (cost=1818698.96..1818698.96 rows=25322396 width=10) (actual time=169587.802..169587.802 rows=26725870 loops=1)
Buckets: 262144 Batches: 32 Memory Usage: 36228kB
-> Seq Scan on members c (cost=0.00..1818698.96 rows=25322396 width=10) (actual time=0.052..160081.880 rows=26725870 loops=1)
Planning time: 0.901 ms
Execution time: 177524.526 ms
As you can see that without even deleting any records both queries show similar performance taking ~3 minutes.
The server disk I/O spikes to 100% so I'm assuming that data is being spilled out to the disk because a sequential scan is done on both contacts and members.
The server is an EC2 r3.large (15GB RAM).
Any ideas on what I can do to optimize this query?
Update #1:
After running vacuum analyze for both tables and ensuring enable_mergejoin is set to on there is no difference in the query time:
DELETE FROM contacts
WHERE NOT EXISTS (SELECT 1
FROM members c
WHERE c.contact_id = contacts.id);
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.342..209406.342 rows=0 loops=1)
-> Hash Anti Join (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.338..209406.338 rows=0 loops=1)
Hash Cond: (contacts.id = c.contact_id)
-> Seq Scan on contacts (cost=0.00..105683.28 rows=3227528 width=10) (actual time=0.008..1010.643 rows=3227462 loops=1)
-> Hash (cost=1814029.74..1814029.74 rows=24855474 width=10) (actual time=198054.302..198054.302 rows=27307060 loops=1)
Buckets: 262144 Batches: 32 Memory Usage: 37006kB
-> Seq Scan on members c (cost=0.00..1814029.74 rows=24855474 width=10) (actual time=1.132..188654.555 rows=27307060 loops=1)
Planning time: 0.328 ms
Execution time: 209408.040 ms
Update 2:
PG Version:
PostgreSQL 9.4.4 on x86_64-pc-linux-gnu, compiled by x86_64-pc-linux-gnu-gcc (Gentoo Hardened 4.5.4 p1.0, pie-0.4.7) 4.5.4, 64-bit
Relation size:
Table | Size | External Size
-----------------------+---------+---------------
members | 23 GB | 11 GB
contacts | 944 MB | 371 MB
Settings:
work_mem
----------
64MB
random_page_cost
------------------
4
Update 3:
Experimenting with doing this in batches doesn't seem to help out on the I/O usage (still spikes to 100%) and doesn't seem to improve on time despite using index-based plans.
DO $do$
BEGIN
FOR i IN 57..668
LOOP
DELETE
FROM contacts
WHERE contacts.id IN
(
SELECT contacts.id
FROM contacts
left outer join members
ON members.contact_id = contacts.id
WHERE members.id IS NULL
AND contacts.id >= (i * 10000)
AND contacts.id < ((i+1) * 10000));
END LOOP;END $do$;
I had to kill the query after Time: 1203492.326 ms and disk I/O stayed at 100% the entire time the query ran. I also experimented with 1,000 and 5,000 chunks but did not see any increase in performance.
Note: The 57..668 range was used because I know those are existing contact IDs. (E.g. min(id) and max(id))

One approach to problems like this can be to do it in smaller chunks.
DELETE FROM "contacts"
WHERE "contacts"."id" IN (
SELECT id
FROM contacts
LEFT OUTER JOIN members ON members.contact_id = contacts.id
WHERE members.id IS NULL
AND id >= 1 AND id < 1000
);
DELETE FROM "contacts"
WHERE "contacts"."id" IN (
SELECT id
FROM contacts
LEFT OUTER JOIN members ON members.contact_id = contacts.id
WHERE members.id IS NULL
AND id >= 1001 AND id < 2000
);
Rinse, repeat. Experiment with different chunk sizes to find an optimal one for your data set, which uses the fewest queries, while keeping them all in memory.
Naturally, you would want to script this, possibly in plpgsql, or in whatever scripting language you prefer.

Any ideas on what I can do to optimize this query?
Your queries are perfect. I would use the NOT EXISTS variant.
Your index index_members_on_contact_id_and_step_id is also good for it:
Is a composite index also good for queries on the first field?
But see below about BRIN indexes.
You can tune your server, table and index configuration.
Since you do not actually update or delete many rows (hardly any at all, according to your comment?), you need to optimize read performance.
1. Upgrade your Postgres version
You provided:
The server is an EC2 r3.large (15GB RAM).
And:
PostgreSQL 9.4.4
Your version is seriously outdated. At least upgrade to the latest minor version. Better yet, upgrade to the current major version. Postgres 9.5 and 9.6 brought major improvements for big data - which is what you need exactly.
Consider the versioning policy of the project.
Amazon allows you to upgrade!
2. Improve table statistics
There is an unexpected 10% mismatch between expected and actual row count in the basic sequential scan:
Seq Scan on members c (cost=0.00..1814029.74 rows=24855474 width=10) (actual time=1.132..188654.555 rows=27307060 loops=1)
Not dramatic at all, but still should not occur in this query. Indicates that you might have to tune your autovacuum settings - possibly per table for the very big ones.
More problematic:
Hash Anti Join (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.338..209406.338 rows=0 loops=1)
Postgres expects to find 1875003 rows to delete, while actually 0 rows are found. That's unexpected. Maybe substantially increasing the statistics target on members.contact_id and contacts.id can help to decrease the gap, which might allow better query plans. See:
Keep PostgreSQL from sometimes choosing a bad query plan
3. Avoid table and index bloat
Your ~ 25MM rows in members occupy 23 GB - that's almost 1kb per row, which seems excessive for the table definition you presented (even if the total size you provided should include indexes):
4 bytes item identifier
24 tuple header
8 null bitmap
36 9x integer
16 2x ts
1 1x bool
?? 1x jsonb
See:
Making sense of Postgres row sizes
That's 89 bytes per row - or less with some NULL values - and hardly any alignment padding, so 96 bytes max, plus your jsonb column.
Either that jsonb column is very big which would make me suggest to normalize the data into separate columns or a separate table. Consider:
How to perform update operations on columns of type JSONB in Postgres 9.4
Or your table is bloated, which can be solved with VACUUM FULL ANALYZE or, while being at it:
CLUSTER members USING index_members_on_contact_id_and_step_id;
VACUUM members;
But either takes an exclusive lock on the table, which you say you cannot afford. pg_repack can do it without exclusive lock. See:
VACUUM returning disk space to operating system
Even if we factor in index sizes, your table seems too big: you have 7 small indexes, each 36 - 44 bytes per row without bloat, less with NULL values, so < 300 bytes altogether.
Either way, consider more aggressive autovacuum settings for your table members. Related:
Aggressive Autovacuum on PostgreSQL
What fillfactor for caching table?
And / or stop bloating the table to begin with. Are you updating rows a lot? Any particular column you update a lot? That jsonb column maybe? You might move that to a separate (1:1) table just to stop bloating the main table with dead tuples - and keeping autovacuum from doing its job.
4. Try a BRIN index
Block range indexes require Postgres 9.5 or later and dramatically reduce index size. I was too optimistic in my first draft. A BRIN index is perfect for your use case if you have many rows in members for each contact.id - after physically clustering your table at least once (see ③ for the fitting CLUSTER command). In that case Postgres can rule out whole data pages quickly. But your numbers indicate only around 8 rows per contact.id, so data pages would often contain multiple values, which voids much of the effect. Depends on actual details of your data distribution ...
On the other hand, as it stands, your tuple size is around 1 kb, so only ~ 8 rows per data page (typically 8kb). If that isn't mostly bloat, a BRIN index might help after all.
But you need to upgrade your server version first. See ①.
CREATE INDEX members_contact_id_brin_idx ON members USING BRIN (contact_id);

Update statistics used by the planner and set enable_mergejoin to on:
vacuum analyse members;
vacuum analyse contacts;
set enable_mergejoin to on;
You should get a query plan similar to this one:
explain analyse
delete from contacts
where not exists (
select 1
from members c
where c.contact_id = contacts.id);
QUERY PLAN
----------------------------------------------------------------------
Delete on contacts
-> Merge Anti Join
Merge Cond: (contacts.id = c.contact_id)
-> Index Scan using contacts_pkey on contacts
-> Index Scan using members_contact_id_idx on members c

Here is another variant to try:
DELETE FROM contacts
USING contacts c
LEFT JOIN members m
ON c.id = m.contact_id
WHERE m.contact_id IS NULL;
It uses a technique for deleting from a joined query described here.
I can't vouch for whether this would definitely be faster but it might be because of the avoidance of a subquery. Would be interested in the results...

Using subquery in where clause take a lot of time
you should use with and using this will be a lot a lot a lot ... faster
with
c_not_member as (
-- here extarct the id of contacts that not in members
SELECT
c.id
FROM contacts c LEFT JOIN members m on c.id = m.contact_id
WHERE
-- to get the contact that don't exist in member just
-- use condition in a field on member that cannot be null
-- in this case you have id
m.id is null
-- the only case when m.id is null is when c.id does not have m.contact_id maching c.id
-- in another way c.id doesn't exists in m.contact_id
)
DELETE FROM contacts all_c using c_not_member WHERE all_c.id = not_member.id ;

Related

Slow running PostgreSQL query

I am currently trying to migrate a system to postgres and I am unfortunately not able to understand why a specific query is running so incredibly slow. Both in Transact-SQL and Oracle the same query runs in under 200ms. First things first though, I have a big table with 14.000.000 entries currently which only gets bigger and bigger. Furthermore the table itself has 54 columns meaning that we have quite the sum of data.
The table has a rather straight forward structure like this:
CREATE TABLE logtable (
key varchar(20) NOT NULL,
column1 int4 NULL,
entrytype int4 NULL,
column2 int4 NULL,
column3 int4 NULL,
column4 int4 NULL,
column5 int4 NULL,
column6 int4 NULL,
column7 varchar(128) NULL,
column8 varchar(2048) NULL,
column9 varchar(2048) NULL,
column10 varchar(2048) NULL,
...
timestampcol timestamp NULL,
column48 timestamp NULL,
column49 timestamp NULL,
column50 timestamp NULL,
column51 timestamp NULL,
column52 int4 NULL,
column53 int4 NULL,
column54 varchar(20) NULL,
CONSTRAINT key PRIMARY KEY (key)
);
We also have a few predefined indexes:
CREATE INDEX idx1 ON logtable USING btree (id);
CREATE INDEX idx2 ON logtable USING btree (archiveinterval);
CREATE INDEX idx3 ON logtable USING btree (archivestatus);
CREATE INDEX idx4 ON logtable USING btree (entrytype);
CREATE INDEX idx5 ON logtable USING btree (column34);
CREATE INDEX idx6 ON logtable USING btree (timestampcol);
Now the actual query that I perform is the following:
SELECT column1,..,column54
FROM logtable
where ((entrytype = 4000 or entrytype = 4001 or entrytype = 4002) and (archivestatus <= 1))
order by timestampcol desc;
This results in roughly 500K selected items.
When establishing the connection, i also pass defaultRowFetchSize=5000 so the resultset doesn't try to get the full result set. As mentionend before, the same query takes about 200 ms in Oracle and MSSQL. Which leaves me wondering, what is exactly going on here. When I add a LIMIT 100, it reduces the query performance to 100 ms.
Now I've already set these variables higher since I've seen these in multiple forum threads:
maintenance_work_mem 1GB
shared_buffers 2GB
I've also tried understanding the explain analyze resulting from the query. As I see it, it takes about 49 s just trying to bitmap heap scan.
Gather Merge (cost=459158.89..507278.61 rows=412426 width=2532) (actual time=57323.536..59044.943 rows=514825 loops=1)
Output: key, column2 ... column54
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1411 read=292867
-> Sort (cost=458158.86..458674.40 rows=206213 width=2532) (actual time=57243.386..57458.979 rows=171608 loops=3)
Output: key, column2 ... column54
Sort Key: logtable.timestampcol DESC
Sort Method: quicksort Memory: 60266kB
Worker 0: Sort Method: quicksort Memory: 57572kB
Worker 1: Sort Method: quicksort Memory: 57878kB
Buffers: shared hit=1411 read=292867
Worker 0: actual time=57218.621..57449.331 rows=168159 loops=1
Buffers: shared hit=470 read=94622
Worker 1: actual time=57192.076..57423.333 rows=169151 loops=1
Buffers: shared hit=461 read=95862
-> Parallel Bitmap Heap Scan on logtable (cost=9332.66..439956.67 rows=206213 width=2532) (actual time=1465.971..56452.327 rows=171608 loops=3)
Output: key, column2 ... column54
Recheck Cond: ((logtable.entrytype = 4000) OR (logtable.entrytype = 4001) OR (logtable.entrytype = 4002))
Filter: ((logtable.entrytype = 4000) OR (logtable.entrytype = 4001) OR ((logtable.entrytype = 4002) AND (logtable.archivestatus <= 1)))
Heap Blocks: exact=101535
Buffers: shared hit=1397 read=292867
Worker 0: actual time=1440.278..56413.158 rows=168159 loops=1
Buffers: shared hit=463 read=94622
Worker 1: actual time=1416.245..56412.907 rows=169151 loops=1
Buffers: shared hit=454 read=95862
-> BitmapOr (cost=9332.66..9332.66 rows=500289 width=0) (actual time=1358.696..1358.697 rows=0 loops=1)
Buffers: shared hit=6 read=1322
-> Bitmap Index Scan on idx4(entrytype) (cost=0.00..1183.80 rows=66049 width=0) (actual time=219.270..219.271 rows=65970 loops=1)
Index Cond: (logtable.entrytype = 4000)
Buffers: shared hit=1 read=171
-> Bitmap Index Scan on idx4(entrytype) (cost=0.00..3792.43 rows=211733 width=0) (actual time=691.854..691.855 rows=224437 loops=1)
Index Cond: (logtable.entrytype = 4001)
Buffers: shared hit=2 read=576
-> Bitmap Index Scan on idx4(entrytype) (cost=0.00..3985.24 rows=222507 width=0) (actual time=447.558..447.558 rows=224418 loops=1)
Index Cond: (logtable.entrytype = 4002)
Buffers: shared hit=3 read=575
Planning Time: 0.562 ms
Execution Time: 59503.154 ms
When I do the same query WITHOUT the Order by, the query finishes in about 1.6 s which seems reasonable enough. When I take away the where clause, the query finishes in 86 ms which is due to my idx6.
I am kinda out of ideas. I've tried multiple indexes. Some composite indexes like = (entrytype, archivestatus, timestampcol) in different orders and with DESC. Is there something else I could try?
UPDATE:
Since a few of you asked, here i the query execution plan for Oracle. As I said, the literal same statement witht he same indexes runs in 0.2 seconds in oracle whereas it needs about 30-50 s in postgres.
------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6878 | 2491K| 2147 (1)| 00:00:01 |
| 1 | SORT ORDER BY | | 6878 | 2491K| 2147 (1)| 00:00:01 |
| 2 | CONCATENATION | | | | | |
|* 3 | TABLE ACCESS BY INDEX ROWID BATCHED | logtable | 712 | 257K| 168 (0)| 00:00:01 |
|* 4 | INDEX RANGE SCAN | entrytype | 712 | | 5 (0)| 00:00:01 |
| 5 | INLIST ITERATOR | | | | | |
|* 6 | TABLE ACCESS BY INDEX ROWID BATCHED| logtable | 6166 | 2233K| 1433 (1)| 00:00:01 |
|* 7 | INDEX RANGE SCAN | idx_entrytype | 6166 | | 22 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------------
As someone mentionend, I had tried set enable_bitmapscan to off before but it didn't quite help. It had an impact, making the query faster, but it didn't exactly help to the point where I would consider using it.
Gather Merge (cost=543407.97..593902.72 rows=432782 width=2538) (actual time=26207.686..27543.386 rows=515559 loops=1)
Output: column1 ... column54
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=258390 read=147694 dirtied=3 written=1
-> Sort (cost=542407.94..542948.92 rows=216391 width=2538) (actual time=26135.793..26300.677 rows=171853 loops=3)
Output: column1 ... column54
Sort Key: logtable.timestampcol DESC
Sort Method: quicksort Memory: 61166kB
Worker 0: Sort Method: quicksort Memory: 56976kB
Worker 1: Sort Method: quicksort Memory: 57770kB
Buffers: shared hit=258390 read=147694 dirtied=3 written=1
Worker 0: actual time=26100.640..26257.665 rows=166629 loops=1
Buffers: shared hit=83315 read=48585 dirtied=2
Worker 1: actual time=26102.323..26290.745 rows=169509 loops=1
Buffers: shared hit=84831 read=48779
-> Parallel Seq Scan on logtable (cost=0.00..523232.15 rows=216391 width=2538) (actual time=3.752..25627.657 rows=171853 loops=3)
Output: column1 ... column54
Filter: ((logtable.entrytype = 4000) OR (logtable.entrytype = 4001) OR ((logtable.entrytype = 4002) AND (logtable.archivestatus <= 1)))
Rows Removed by Filter: 4521112
Buffers: shared hit=258294 read=147694 dirtied=3 written=1
Worker 0: actual time=1.968..25599.701 rows=166629 loops=1
Buffers: shared hit=83267 read=48585 dirtied=2
Worker 1: actual time=3.103..25604.552 rows=169509 loops=1
Buffers: shared hit=84783 read=48779
Planning Time: 0.816 ms
Execution Time: 27914.204 ms
Just to clarify, my hope is that there is some kind of configuration, index or something else I've missed to put. Since we have a generic mechanism creating this query, it would be quite ugly to implement a database specific query JUST for this specific table. It's our biggest table by far containing log entries from throughout the system. (I don't exactly like this design but it is what it is) There must be a reason why Postgres in particular, is that much slower compared to other databases when handling big data.
As multiple users pointed out, the condition should be:
where ((entrytype = 4000 or entrytype = 4001 or entrytype = 4002) and (archivestatus <= 1))
NOT
where (entrytype = 4000 or entrytype = 4001 or entrytype = 4002 and (archivestatus <= 1))
Sorry for the confusion.
Index lookups are expensive, so sql engine tends to use pk scan, instead of indexes, when retrieving many columns.
You can give a try with a composite index on entrytype+archivestatus (in this order) to extract relevant keys and then rejoin main table to get all columns.
Keep also the index on timestampcol for ordering the results.
SELECT column1,..,column54
FROM (
SELECT key _key
FROM logtable
where entrytype IN (4000, 4001, 4002) and (archivestatus <= 1)
) X
JOIN logtable L ON X._key = L.key
order by timestampcol desc;
Another option is to use the UNION operator to try to force the use of index on a reduced partition
SELECT column1,..,column54
FROM (
SELECT key _key
FROM logtable
where (entrytype = 4000) and (archivestatus <= 1)
UNION ALL
SELECT key _key
FROM logtable
where (entrytype = 4001) and (archivestatus <= 1)
UNION ALL
SELECT key _key
FROM logtable
where (entrytype = 4002) and (archivestatus <= 1)
) X
JOIN logtable L ON X._key = L.key
order by timestampcol desc;
Results may vary depending on many factors, so you have to try the best approach for your environment.
After analyzing different versions of postgres, 12.6 and 13.2 on Windows and even 13.2 on Linux docker, I could never reproduce the issue. We concluded that it must have been due to the environment acting up. Either that or Disk or Network issues. Either way, this had was no Postgres error.

JSONB ILIKE indexing

I have a table people with body column as a jsonb type.
Table "public.people"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-----------------+-----------------------------+-----------+----------+--------------------+----------+--------------+-------------
id | uuid | | not null | uuid_generate_v4() | plain | |
body | jsonb | | not null | | extended | |
Indexes:
"people_pkey" PRIMARY KEY, btree (id)
"idx_name" gin ((body ->> 'name'::text) gin_trgm_ops)
My index looks as follows:
CREATE INDEX idx_name ON people USING gin ((body ->> 'name') gin_trgm_ops);
However, when I do:
EXPLAIN ANALYZE SELECT * FROM "people" WHERE ((body ->> 'name') ILIKE '%asd%') LIMIT 40 OFFSET 0;
I see:
Limit (cost=0.00..33.58 rows=40 width=104) (actual time=100.037..4066.964 rows=11 loops=1)
-> Seq Scan on people (cost=0.00..2636.90 rows=3141 width=104) (actual time=99.980..4066.782 rows=11 loops=1)
Filter: ((body ->> 'name'::text) ~~* '%asd%'::text)
Rows Removed by Filter: 78516
Planning time: 0.716 ms
Execution time: 4067.038 ms
Why is the index not used there?
update
to avoid confusion with operators mentionned above I wll quote
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gin
Gin comes with built-in support for one-dimensional arrays (eg.
integer[], text[]), but no support for NULL elements. The following
operations are available:
contains: value_array #> query_array
overlap: value_array && query_array
contained: value_array <# query_array
if you want to use advantages of GIN, use #>, not LIKE operator
Also, please look at much better Erwins answer on close question

Can a View of multiple tables be used for Full-Text-Search?

I'm sorry to ask such a noob question, but the postgres documentation on views is sparse, and I had trouble finding a good answer.
I'm trying to implement Full-Text-Search on Postgres for three tables. Specifically, the user's search query would return matching 1) other usernames, 2) message, 3) topics.
I'm concerned that using a view for this might not scale well as it combines three tables into one. Is this a legitimate concern? If not, how else might I approach this?
What you request can be done. To have a practical example (with just two tables), you could have:
CREATE TABLE users
(
user_id SERIAL PRIMARY KEY,
username text
) ;
-- Index to find usernames
CREATE INDEX idx_users_username_full_text
ON users
USING GIN (to_tsvector('english', username)) ;
CREATE TABLE topics
(
topic_id SERIAL PRIMARY KEY,
topic text
) ;
-- Index to find topics
CREATE INDEX idx_topics_topic_full_text
ON topics
USING GIN (to_tsvector('english', topic)) ;
See PostgreSQL docs. on Controlling Text Search for an explanation of to_tsvector.
... populate the tables
INSERT INTO users
(username)
VALUES
('Alice Cooper'),
('Boo Geldorf'),
('Carol Burnet'),
('Daniel Dafoe') ;
INSERT INTO topics
(topic)
VALUES
('Full text search'),
('Fear of void'),
('Alice in Wonderland essays') ;
... create a view that combines values from both tables
CREATE VIEW search_items AS
SELECT
text 'users' AS origin_table, user_id AS id, to_tsvector('english', username) AS searchable_element
FROM
users
UNION ALL
SELECT
text 'topics' AS origin_table, topic_id AS id, to_tsvector('english', topic) AS searchable_element
FROM
topics ;
We search that view:
SELECT
*
FROM
search_items
WHERE
plainto_tsquery('english', 'alice') ## searchable_element
... and get the following response (you should mostly ignore the searchable_element). You're mostly interested in the origin_table and id.
origin_table | id | searchable_element
:----------- | -: | :--------------------------------
users | 1 | 'alic':1 'cooper':2
topics | 3 | 'alic':1 'essay':4 'wonderland':3
See Parsing Queries for an explanation of plainto_tsquery function, and also ## operator.
To make sure indexes are used:
EXPLAIN ANALYZE
SELECT
*
FROM
search_items
WHERE
plainto_tsquery('english', 'alice') ## searchable_element
| QUERY PLAN |
| :----------------------------------------------------------------------------------------------------------------------------------------- |
| Append (cost=12.05..49.04 rows=12 width=68) (actual time=0.017..0.031 rows=2 loops=1) |
| -> Bitmap Heap Scan on users (cost=12.05..24.52 rows=6 width=68) (actual time=0.017..0.018 rows=1 loops=1) |
| Recheck Cond: ('''alic'''::tsquery ## to_tsvector('english'::regconfig, username)) |
| Heap Blocks: exact=1 |
| -> Bitmap Index Scan on idx_users_username_full_text (cost=0.00..12.05 rows=6 width=0) (actual time=0.005..0.005 rows=1 loops=1) |
| Index Cond: ('''alic'''::tsquery ## to_tsvector('english'::regconfig, username)) |
| -> Bitmap Heap Scan on topics (cost=12.05..24.52 rows=6 width=68) (actual time=0.012..0.012 rows=1 loops=1) |
| Recheck Cond: ('''alic'''::tsquery ## to_tsvector('english'::regconfig, topic)) |
| Heap Blocks: exact=1 |
| -> Bitmap Index Scan on idx_topics_topic_full_text (cost=0.00..12.05 rows=6 width=0) (actual time=0.002..0.002 rows=1 loops=1) |
| Index Cond: ('''alic'''::tsquery ## to_tsvector('english'::regconfig, topic)) |
| Planning time: 0.098 ms |
| Execution time: 0.055 ms |
Indexes are really used (see Bitmap Index Scan on idx_topics_topic_full_text and Bitmap Index Scan on idx_users_username_full_text).
You can check everything at dbfiddle here
NOTE: 'english' is the text search configuration chosen to index and query. Choose the proper one for your case. You can create your own if the existing ones don't fill your needs.

Postgres Query Tuning

I have a table that holds historical records. Whenever a count gets updated, a record is added specifying that a new value was fetched at that time. The table schema looks like this:
Column | Type | Modifiers
---------------+--------------------------+--------------------------------------------------------------------
id | integer | not null default nextval('project_accountrecord_id_seq'::regclass)
user_id | integer | not null
created | timestamp with time zone | not null
service | character varying(200) | not null
metric | character varying(200) | not null
value | integer | not null
Now I'd like to get the total number of records updated each day, for the last seven days. Here's what I came up with:
SELECT
created::timestamp::date as created_date,
count(created)
FROM
project_accountrecord
GROUP BY
created::timestamp::date
ORDER BY
created_date DESC
LIMIT 7;
This runs slowly (11406.347ms). EXPLAIN ANALYZE gives:
Limit (cost=440939.66..440939.70 rows=7 width=8) (actual time=24184.547..24370.715 rows=7 loops=1)
-> GroupAggregate (cost=440939.66..477990.56 rows=6711746 width=8) (actual time=24184.544..24370.699 rows=7 loops=1)
-> Sort (cost=440939.66..444340.97 rows=6802607 width=8) (actual time=24161.120..24276.205 rows=92413 loops=1)
Sort Key: (((created)::timestamp without time zone)::date)
Sort Method: external merge Disk: 146328kB
-> Seq Scan on project_accountrecord (cost=0.00..153671.43 rows=6802607 width=8) (actual time=0.017..10132.970 rows=6802607 loops=1)
Total runtime: 24420.988 ms
There are a little over 6.8 million rows in this table. What can I do to increase performance of this query? Ideally I'd like it to run in under a second so I can cache it and update it in the background a couple of times a day.
Now, your query must scan whole table, calculate result and limit to 7 recent days.
You can speedup query by scanning only last 7 days (or more if you don't update records every day):
where created_date>now()::date-'7 days'::interval
Another aproach is to cache historical results in extra table and count only current day.

Getting syntax error at or near "FROM" while updating a table

I have two tables
junk=# select * from t;
name | intval
----------+--------
bar2 | 2
bar3 | 3
bar4 | 4
(3 rows)
and
junk=# select * from temp;
id | name | intval
----+------------+--------
1 | foo | 0
2 | foo2 | 2
3 | foo3 | 3
4 | foo4 | 4
5 | foo5 | 5
(5 rows)
Now, I want to use the values from table t to update the values in table temp. Basically, I want to replace the name column in second, third and fourth values in temp by bar2, bar3 and bar4.
I created the table t using the COPY statement. I am doing batch updates and I am trying to optimize that.
So, I get this error. I think this is pretty basic one.
junk=# UPDATE temp FROM t SET name=t.name FROM t WHERE intval=t.intval;
ERROR: syntax error at or near "FROM"
LINE 1: UPDATE temp FROM t SET name=t.name FROM t WHERE intval=t.int...
^
junk=#
Fow now, this works.
UPDATE test SET name=t.name FROM t WHERE test.intval=t.intval
Get rid of your first FROM t clause.
FROM must come after SET, not before and it can only affect the WHERE clause. SET must be done with subqueries.
your completed code is:
UPDATE temp SET name=(SELECT t.name FROM t WHERE temp.intval = t.inval);
PostgreSQL has some ways to optimize this so it's not like you are just doing a huge nested loop join (and looking up one row over and over from the heap based on the join criteria).
Edit: Adding plan to show we are not, in fact, running through a sequential scan of the second table for each row on the first one.
Here is an example that updates 172 rows in one table using a group-by from another:
mtech_test=# explain analyze
update ap
set amount = (select sum(amount) from acc_trans ac where ac.trans_id = ap.id) + 1;
QUERY PLAN
--------------------------------------------------------------------------------
---------------------------------------------------------------------
Update on ap (cost=0.00..3857.06 rows=229 width=231) (actual time=39.074..39.0
74 rows=0 loops=1)
-> Seq Scan on ap (cost=0.00..3857.06 rows=229 width=231) (actual time=0.050..28.444 rows=172 loops=1)
SubPlan 1
-> Aggregate (cost=16.80..16.81 rows=1 width=5) (actual time=0.109..0.110 rows=1 loops=172)
-> Bitmap Heap Scan on acc_trans ac (cost=4.28..16.79 rows=4 width=5) (actual time=0.075..0.102 rows=4 loops=172)
Recheck Cond: (trans_id = ap.id)
-> Bitmap Index Scan on acc_trans_trans_id_key (cost=0.00..4.28 rows=4 width=0) (actual time=0.006..0.006 rows=4 loops=172)
Index Cond: (trans_id = ap.id)
Trigger for constraint ap_entity_id_fkey: time=69.532 calls=172
Trigger ap_audit_trail: time=391.722 calls=172
Trigger ap_track_global_sequence: time=1.954 calls=172
Trigger check_department: time=111.301 calls=172
Total runtime: 612.001 ms
(13 rows)
`