DISTINCT with ORDER BY very slow - sql

So I am using postgres for the first time and finding it rather slow to run distinct and group by queries, currently i am trying to find the latest record and whether or not it is working or not.
This is the first query I came up with:
SELECT DISTINCT ON (device_id) c.device_id, c.timestamp, c.working
FROM call_logs c
ORDER BY c.device_id, c.timestamp desc
And it works but it is taking along time to run.
Unique (cost=94840.24..97370.54 rows=11 width=17) (actual time=424.424..556.253 rows=13 loops=1)
-> Sort (cost=94840.24..96105.39 rows=506061 width=17) (actual time=424.423..531.905 rows=506061 loops=1)
Sort Key: device_id, "timestamp" DESC
Sort Method: external merge Disk: 13272kB
-> Seq Scan on call_logs c (cost=0.00..36512.61 rows=506061 width=17) (actual time=0.059..162.932 rows=506061 loops=1)
Planning time: 0.152 ms
Execution time: 557.957 ms
(7 rows)
I have updated the query to use the following which is faster but very ugly:
SELECT c.device_id, c.timestamp, c.working FROM call_logs c
INNER JOIN (SELECT c.device_id, MAX(c.timestamp) AS timestamp
FROM call_logs c
GROUP BY c.device_id)
newest on newest.timestamp = c.timestamp
and the analysis:
Nested Loop (cost=39043.34..39136.08 rows=12 width=17) (actual time=216.406..216.580 rows=15 loops=1)
-> HashAggregate (cost=39042.91..39043.02 rows=11 width=16) (actual time=216.347..216.351 rows=13 loops=1)
Group Key: c_1.device_id
-> Seq Scan on call_logs c_1 (cost=0.00..36512.61 rows=506061 width=16) (actual time=0.026..125.482 rows=506061 loops=1)
-> Index Scan using call_logs_timestamp on call_logs c (cost=0.42..8.44 rows=1 width=17) (actual time=0.016..0.016 rows=1 loops=13)
Index Cond: ("timestamp" = (max(c_1."timestamp")))
Planning time: 0.318 ms
Execution time: 216.631 ms
(8 rows)
Even 200ms does seem a little slow to me as all I want is the top record per device (which is in an indexed table)
AND this is my index it is using:
CREATE INDEX call_logs_timestamp
ON public.call_logs USING btree
(timestamp)
TABLESPACE pg_default;
I have tried the below index but does not help at all:
CREATE INDEX dev_ts_1
ON public.call_logs USING btree
(device_id, timestamp DESC, working)
TABLESPACE pg_default;
Any ideas am I missing something obvious?

200 ms really isn't that bad for going through 500k rows. But for this query:
SELECT DISTINCT ON (device_id) c.device_id, c.timestamp, c.working
FROM call_logs c
ORDER BY c.device_id, c.timestamp desc
Then your index on call_logs(device_id, timestamp desc, working) should be an optimal index.
Two other ways to write the query for the same index are:
select c.*
from (select c.device_id, c.timestamp, c.working, c.*,
row_number() over (partition by device_id order by timestamp desc) as seqnum
from call_logs c
) c
where seqnum = 1;
and:
select c.device_id, c.timestamp, c.working
from call_logs c
where not exists (select 1
from call_logs c2
where c2.device_id = c.device_id and
c2.timestamp > c.timestamp
);

Related

Why does this GROUP BY cause a full table scan?

I have a table of projects and a table of tasks, with each task referencing a single project. I want to get a list of projects sorted by their due dates along with the number of tasks in each project. I can pretty easily write this query two ways. First, using JOIN and GROUP BY:
SELECT p.name, p.due_date, COUNT(t.id) as num_tasks
FROM projects p
LEFT OUTER JOIN tasks t ON t.project_id = p.id
GROUP BY p.id
ORDER BY p.due_date ASC LIMIT 20;
Second, using a subquery:
SELECT p.name, p.due_date, (SELECT
COUNT(*) FROM tasks t WHERE t.project_id = p.id) as num_tasks
FROM projects p
ORDER BY p.due_date ASC LIMIT 20;
I'm using PostgreSQL 10, and I've got indices on projects.id, projects.due_date and tasks.project_id. Why does the first query using the GROUP BY clause do a full table scan while the second query makes proper use of the indices? It seems like these should compile down to the same thing.
Note that if I remove the GROUP BY and the COUNT(t.id) from the first query it will run quickly, just with lots of duplicate rows. So the problem is with the GROUP BY clause, not the JOIN. This seems like it's about the simplest GROUP BY one could do, so I'd like to understand if/how to make it more efficient before moving on to more complicated queries.
Edit — here's the result of EXPLAIN ANALYZE. First query:
Limit (cost=41919.58..41919.63 rows=20 width=53) (actual time=1046.762..1046.771 rows=20 loops=1)
-> Sort (cost=41919.58..42169.58 rows=100000 width=53) (actual time=1046.760..1046.765 rows=20 loops=1)
Sort Key: p.due_date
Sort Method: top-N heapsort Memory: 29kB
-> GroupAggregate (cost=0.71..39258.62 rows=100000 width=53) (actual time=0.109..1002.890 rows=100000 loops=1)
Group Key: p.id
-> Merge Left Join (cost=0.71..35758.62 rows=500000 width=49) (actual time=0.072..807.603 rows=500702 loops=1)
Merge Cond: (p.id = t.project_id)
-> Index Scan using projects_pkey on projects p (cost=0.29..3542.29 rows=100000 width=45) (actual time=0.025..38.363 rows=100000 loops=1)
-> Index Scan using project_id_idx on tasks t (cost=0.42..25716.33 rows=500000 width=8) (actual time=0.038..531.097 rows=500000 loops=1)
Planning Time: 0.573 ms
Execution Time: 1046.934 ms
Second query:
Limit (cost=0.29..92.61 rows=20 width=49) (actual time=0.079..0.443 rows=20 loops=1)
-> Index Scan using project_date_idx on projects p (cost=0.29..461594.09 rows=100000 width=49) (actual time=0.076..0.432 rows=20 loops=1)
SubPlan 1
-> Aggregate (cost=4.54..4.55 rows=1 width=8) (actual time=0.015..0.016 rows=1 loops=20)
-> Index Only Scan using project_id_idx on tasks t (cost=0.42..4.53 rows=6 width=0) (actual time=0.009..0.011 rows=5 loops=20)
Index Cond: (project_id = p.id)
Heap Fetches: 0
Planning Time: 0.284 ms
Execution Time: 0.551 ms
And if anyone wants to try to exactly reproduce this, here's my setup:
CREATE TABLE projects (
id serial NOT NULL PRIMARY KEY,
name varchar(100) NOT NULL,
due_date timestamp NOT NULL
);
CREATE TABLE tasks (
id serial NOT NULL PRIMARY KEY,
project_id integer NOT NULL,
data real NOT NULL
);
INSERT INTO projects (name, due_date) SELECT
md5(random()::text),
timestamp '2020-01-01 00:00:00' +
random() * (timestamp '2030-01-01 20:00:00' - timestamp '2020-01-01 10:00:00')
FROM generate_series(1, 100000);
INSERT INTO tasks (project_id, data)
SELECT CAST(1 + random()*99999 AS integer), random()
FROM generate_series(1, 500000);
CREATE INDEX project_date_idx ON projects ("due_date");
CREATE INDEX project_id_idx ON tasks ("project_id");
ALTER TABLE tasks ADD CONSTRAINT task_foreignkey FOREIGN KEY ("project_id") REFERENCES "projects" ("id") DEFERRABLE INITIALLY DEFERRED;

What is the fastest way in a growing table to know if at least one line is existing with two search (WHERE) criteria?

In pgsql, what's the fastest request to see if a table BLOG_POST is having column company_id=5 and status_id=3 at least once considering the table can grow?
I have many company using that table and they can have many entry, my end goal is too create a method named hasCompanyAlreadyPublishedABlogPost(companyId).
An EXISTS condition would do:
select exists (select *
from blog_post
where company_id = 5
and status_id = 3);
Obviously you want an index on blog_post(company_id, status_id)
To further improve the performance we can do this:
select exists (select id
from blog_post
where company_id = 5
and status_id = 3 limit 1);
Add limit in nested query
Instead of select * we can do select any_column
I have tried this with a sample database:
with id and limit:
explain analyze (select exists (select user_id from sample_table where sample_ids=4 and user_id=5 limit 1));
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Result (cost=8.17..8.18 rows=1 width=1) (actual time=0.014..0.014 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Index Scan using sample_table_pkey on sample_table (cost=0.15..8.17 rows=1 width=0) (actual time=0.012..0.012 rows=1 loops=1)
Index Cond: (user_id = 5)
Filter: (sample_ids = 4)
Planning Time: 0.091 ms
Execution Time: 0.032 ms
without id and *
explain analyze (select exists (select * from sample_table where sample_ids=4 and user_id=5));
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Result (cost=8.17..8.18 rows=1 width=1) (actual time=0.014..0.014 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Index Scan using sample_table_pkey on sample_table (cost=0.15..8.17 rows=1 width=0) (actual time=0.012..0.012 rows=1 loops=1)
Index Cond: (user_id = 5)
Filter: (sample_ids = 4)
Planning Time: 0.084 ms
Execution Time: 0.034 ms

possible to speed up this query?

I have the following query which takes a little too long to execute. I have posted the EXPLAIN ANALYZE for the query. Anything I can do to improve its speed?
EXPLAIN analyze SELECT c.*, match.user_json FROM match INNER JOIN conversation c
ON match.match_id = c.match_id WHERE c.from_id <> 142822281 AND c.to_id =
142822281 AND c.unix_timestamp = (SELECT max( unix_timestamp ) FROM conversation
WHERE match_id = c.match_id GROUP BY match_id)
EXPLAIN ANALYZE results
Nested Loop (cost=0.00..16183710.79 rows=2 width=805) (actual time=2455.133..2502.781 rows=34 loops=1)
Join Filter: (match.match_id = c.match_id)
Rows Removed by Join Filter: 71502
-> Seq Scan on match (cost=0.00..268.51 rows=2151 width=723) (actual time=0.006..4.973 rows=2104 loops=1)
-> Materialize (cost=0.00..16183377.75 rows=2 width=90) (actual time=0.034..1.168 rows=34 loops=2104)
-> Seq Scan on conversation c (cost=0.00..16183377.74 rows=2 width=90) (actual time=70.972..2421.949 rows=34 loops=1)
Filter: ((from_id <> 142822281) AND (to_id = 142822281) AND (unix_timestamp = (SubPlan 1)))
Rows Removed by Filter: 22010
SubPlan 1
-> GroupAggregate (cost=0.00..739.64 rows=10 width=16) (actual time=5.358..5.358 rows=1 loops=450)
Group Key: conversation.match_id
-> Seq Scan on conversation (cost=0.00..739.49 rows=10 width=16) (actual time=3.355..5.320 rows=17 loops=450)
Filter: (match_id = c.match_id)
Rows Removed by Filter: 22027
Planning Time: 1.132 ms
Execution Time: 2502.926 ms
This is your query:
SELECT c.*, m.user_json
FROM match m INNER JOIN
conversation c
ON m.match_id = c.match_id
WHERE c.from_id <> 142822281 AND
c.to_id = 142822281 AND
c.unix_timestamp = (SELECT max( c2.unix_timestamp )
FROM conversation c2
WHERE c2.match_id = c.match_id
GROUP BY c2.match_id
);
I would suggest writing it as:
SELECT DISTINCT ON (c.match_id) c.*, m.user_json
FROM match m INNER JOIN
conversation c
ON m.match_id = c.match_id
WHERE c.from_id <> 142822281 AND
c.to_id = 142822281 AND
ORDER BY c.match_id, c.unix_timestamp DESC;
Then try an index on: conversation(to_id, from_id, match_id). I assume you have an index on match(match_id).

Unpredictable query performance in Postgresql

I have tables like that in a Postgres 9.3 database:
A <1---n B n---1> C
Table A contains ~10^7 rows, table B is rather big with ~10^9 rows and C contains ~100 rows.
I use the following query to find all As (distinct) that match some criteria in B and C (the real query is more complex, joins more tables and checks more attributes within the subquery):
Query 1:
explain analyze
select A.SNr from A
where exists (select 1 from B, C
where B.AId = A.Id and
B.CId = C.Id and
B.Timestamp >= '2013-01-01' and
B.Timestamp <= '2013-01-12' and
C.Name = '00000015')
limit 200;
That query takes about 500ms (Note that C.Name = '00000015' exists in the table):
Limit (cost=119656.37..120234.06 rows=200 width=9) (actual time=427.799..465.485 rows=200 loops=1)
-> Hash Semi Join (cost=119656.37..483518.78 rows=125971 width=9) (actual time=427.797..465.460 rows=200 loops=1)
Hash Cond: (a.id = b.aid)
-> Seq Scan on a (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.010..15.058 rows=133470 loops=1)
-> Hash (cost=117588.73..117588.73 rows=125971 width=4) (actual time=427.233..427.233 rows=190920 loops=1)
Buckets: 4096 Batches: 8 Memory Usage: 838kB
-> Nested Loop (cost=0.57..117588.73 rows=125971 width=4) (actual time=0.176..400.326 rows=190920 loops=1)
-> Seq Scan on c (cost=0.00..2.88 rows=1 width=4) (actual time=0.015..0.030 rows=1 loops=1)
Filter: (name = '00000015'::text)
Rows Removed by Filter: 149
-> Index Only Scan using cid_aid on b (cost=0.57..116291.64 rows=129422 width=8) (actual time=0.157..382.896 rows=190920 loops=1)
Index Cond: ((cid = c.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-12 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 476.173 ms
Query 2: Changing C.Name to something that doesn't exist (C.Name = 'foo') takes 0.1ms:
explain analyze
select A.SNr from A
where exists (select 1 from B, C
where B.AId = A.Id and
B.CId = C.Id and
B.Timestamp >= '2013-01-01' and
B.Timestamp <= '2013-01-12' and
C.Name = 'foo')
limit 200;
Limit (cost=119656.37..120234.06 rows=200 width=9) (actual time=0.063..0.063 rows=0 loops=1)
-> Hash Semi Join (cost=119656.37..483518.78 rows=125971 width=9) (actual time=0.062..0.062 rows=0 loops=1)
Hash Cond: (a.id = b.aid)
-> Seq Scan on a (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.010..0.010 rows=1 loops=1)
-> Hash (cost=117588.73..117588.73 rows=125971 width=4) (actual time=0.038..0.038 rows=0 loops=1)
Buckets: 4096 Batches: 8 Memory Usage: 0kB
-> Nested Loop (cost=0.57..117588.73 rows=125971 width=4) (actual time=0.038..0.038 rows=0 loops=1)
-> Seq Scan on c (cost=0.00..2.88 rows=1 width=4) (actual time=0.037..0.037 rows=0 loops=1)
Filter: (name = 'foo'::text)
Rows Removed by Filter: 150
-> Index Only Scan using cid_aid on b (cost=0.57..116291.64 rows=129422 width=8) (never executed)
Index Cond: ((cid = c.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-12 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 0.120 ms
Query 3: Resetting the C.Name to something that exists (like in the first query) and increasing the timestamp by 3 days uses another query plan than before, but is still fast (200ms):
explain analyze
select A.SNr from A
where exists (select 1 from B, C
where B.AId = A.Id and
B.CId = C.Id and
B.Timestamp >= '2013-01-01' and
B.Timestamp <= '2013-01-15' and
C.Name = '00000015')
limit 200;
Limit (cost=0.57..112656.93 rows=200 width=9) (actual time=4.404..227.569 rows=200 loops=1)
-> Nested Loop Semi Join (cost=0.57..90347016.34 rows=160394 width=9) (actual time=4.403..227.544 rows=200 loops=1)
-> Seq Scan on a (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.008..1.046 rows=12250 loops=1)
-> Nested Loop (cost=0.57..7.49 rows=1 width=4) (actual time=0.017..0.017 rows=0 loops=12250)
-> Seq Scan on c (cost=0.00..2.88 rows=1 width=4) (actual time=0.005..0.015 rows=1 loops=12250)
Filter: (name = '00000015'::text)
Rows Removed by Filter: 147
-> Index Only Scan using cid_aid on b (cost=0.57..4.60 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=12250)
Index Cond: ((cid = c.id) AND (aid = a.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-15 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 227.632 ms
Query 4: But that new query plan utterly fails when searching for a C.Name that doesn't exist::
explain analyze
select A.SNr from A
where exists (select 1 from B, C
where B.AId = A.Id and
B.CId = C.Id and
B.Timestamp >= '2013-01-01' and
B.Timestamp <= '2013-01-15' and
C.Name = 'foo')
limit 200;
Now it takes 170 seconds (vs. 0.1ms before!) to return the same 0 rows:
Limit (cost=0.57..112656.93 rows=200 width=9) (actual time=170184.979..170184.979 rows=0 loops=1)
-> Nested Loop Semi Join (cost=0.57..90347016.34 rows=160394 width=9) (actual time=170184.977..170184.977 rows=0 loops=1)
-> Seq Scan on a (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.008..794.626 rows=12020034 loops=1)
-> Nested Loop (cost=0.57..7.49 rows=1 width=4) (actual time=0.013..0.013 rows=0 loops=12020034)
-> Seq Scan on c (cost=0.00..2.88 rows=1 width=4) (actual time=0.013..0.013 rows=0 loops=12020034)
Filter: (name = 'foo'::text)
Rows Removed by Filter: 150
-> Index Only Scan using cid_aid on b (cost=0.57..4.60 rows=1 width=8) (never executed)
Index Cond: ((cid = c.id) AND (aid = a.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-15 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 170185.033 ms
All queries were run after "alter table set statistics" with a value of 10000 on all columns and after running analyze on the whole db.
Right now it looks like the slightest change of a parameter (not even of the SQL) can make Postgres choose a bad plan (0.1ms vs. 170s in this case!). I always try to check query plans when changing things, but it's hard to ever be sure that something will work when such small changes on parameters can make such huge differences. I have similar problems with other queries too.
What can I do to get more predictable results?
(I have tried modifying certain query planning parameters (set enable_... = on/off) and some different SQL statements - joining+distinct/group by instead of "exists" - but nothing seems to make postgres choose "stable" query plans while still providing acceptable performance).
Edit #1: Table + index definitions
test=# \d a
Tabelle äpublic.aô
Spalte | Typ | Attribute
--------+---------+----------------------------------------------------
id | integer | not null Vorgabewert nextval('a_id_seq'::regclass)
anr | integer |
snr | text |
Indexe:
"a_pkey" PRIMARY KEY, btree (id)
"anr_snr_index" UNIQUE, btree (anr, snr)
"anr_index" btree (anr)
Fremdschlnssel-Constraints:
"anr_fkey" FOREIGN KEY (anr) REFERENCES pt(id)
Fremdschlnsselverweise von:
TABLE "b" CONSTRAINT "aid_fkey" FOREIGN KEY (aid) REFERENCES a(id)
test=# \d b
Tabelle äpublic.bô
Spalte | Typ | Attribute
-----------+-----------------------------+-----------
id | uuid | not null
timestamp | timestamp without time zone |
cid | integer |
aid | integer |
prop1 | text |
propn | integer |
Indexe:
"b_pkey" PRIMARY KEY, btree (id)
"aid_cid" btree (aid, cid)
"cid_aid" btree (cid, aid, "timestamp")
"timestamp_index" btree ("timestamp")
Fremdschlnssel-Constraints:
"aid_fkey" FOREIGN KEY (aid) REFERENCES a(id)
"cid_fkey" FOREIGN KEY (cid) REFERENCES c(id)
test=# \d c
Tabelle äpublic.cô
Spalte | Typ | Attribute
--------+---------+----------------------------------------------------
id | integer | not null Vorgabewert nextval('c_id_seq'::regclass)
name | text |
Indexe:
"c_pkey" PRIMARY KEY, btree (id)
"c_name_index" UNIQUE, btree (name)
Fremdschlnsselverweise von:
TABLE "b" CONSTRAINT "cid_fkey" FOREIGN KEY (cid) REFERENCES c(id)
Your problem is that the query needs to evaluate the correlated sub query for the entire table a. When Postgres quickly finds 200 random rows that fit (which seems to occasionally be the case when c.name exists), it yields them accordingly, and reasonably fast if there are plenty to choose from. But when no such rows exists, it evaluates the entire hogwash in the exists() statement as many times as table a has rows, hence the performance issue you're seeing.
Adding an uncorrelated where clause will most certainly fix a number of edge cases:
and exists(select 1 from c where name = ?)
It might also work when you join the latter with b and write it as a cte:
with bc as (
select aid
from b join c on b.cid = c.bid
and b.timestamp between ? and ?
and c.name = ?
)
select a.id
from a
where exists (select 1 from bc)
and exists (select 1 from bc where a.id = bc.aid)
limit 200
If not, just toss in the bc query verbatim instead of using the cte. The point here is to force Postgres to consider the bc lookup as independent, and bail early if the resulting set yields no rows at all.
I assume your query is more complex in the end, but note that the above could be rewritten as:
with bc as (...)
select aid
from bc
limit 200
Or:
with bc as (...)
select a.id
from a
where a.id in (select aid from bc)
limit 200
Both should yield better plans in edge cases.
(Side note: it's usually unadvisable to limit without ordering.)
Maybe try to rewrite query with CTE?
with BC as (
select distinct B.AId from B where
B.Timestamp >= '2013-01-01' and
B.Timestamp <= '2013-01-12' and
B.CId in (select C.Id from C where C.Name = '00000015')
limit 200
)
select A.SNr from A where A.Id in (select AId from BC)
If I understand correctly - limit could be easily put inside BC query to avoid scan on table A.

Optimizing postgres query

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=32164.87..32164.89 rows=1 width=44) (actual time=221552.831..221552.831 rows=0 loops=1)
-> Sort (cost=32164.87..32164.87 rows=1 width=44) (actual time=221552.827..221552.827 rows=0 loops=1)
Sort Key: t.date_effective, t.acct_account_transaction_id, p.method, t.amount, c.business_name, t.amount
-> Nested Loop (cost=22871.67..32164.86 rows=1 width=44) (actual time=221552.808..221552.808 rows=0 loops=1)
-> Nested Loop (cost=22871.67..32160.37 rows=1 width=52) (actual time=221431.071..221546.619 rows=670 loops=1)
-> Nested Loop (cost=22871.67..32157.33 rows=1 width=43) (actual time=221421.218..221525.056 rows=2571 loops=1)
-> Hash Join (cost=22871.67..32152.80 rows=1 width=16) (actual time=221307.382..221491.019 rows=2593 loops=1)
Hash Cond: ("outer".acct_account_id = "inner".acct_account_fk)
-> Seq Scan on acct_account a (cost=0.00..7456.08 rows=365008 width=8) (actual time=0.032..118.369 rows=61295 loops=1)
-> Hash (cost=22871.67..22871.67 rows=1 width=16) (actual time=221286.733..221286.733 rows=2593 loops=1)
-> Nested Loop Left Join (cost=0.00..22871.67 rows=1 width=16) (actual time=1025.396..221266.357 rows=2593 loops=1)
Join Filter: ("inner".orig_acct_payment_fk = "outer".acct_account_transaction_id)
Filter: ("inner".link_type IS NULL)
-> Seq Scan on acct_account_transaction t (cost=0.00..18222.98 rows=1 width=16) (actual time=949.081..976.432 rows=2596 loops=1)
Filter: ((("type")::text = 'debit'::text) AND ((transaction_status)::text = 'active'::text) AND (date_effective >= '2012-03-01'::date) AND (date_effective < '2012-04-01 00:00:00'::timestamp without time zone))
-> Seq Scan on acct_payment_link l (cost=0.00..4648.68 rows=1 width=15) (actual time=1.073..84.610 rows=169 loops=2596)
Filter: ((link_type)::text ~~ 'return_%'::text)
-> Index Scan using contact_pk on contact c (cost=0.00..4.52 rows=1 width=27) (actual time=0.007..0.008 rows=1 loops=2593)
Index Cond: (c.contact_id = "outer".contact_fk)
-> Index Scan using acct_payment_transaction_fk on acct_payment p (cost=0.00..3.02 rows=1 width=13) (actual time=0.005..0.005 rows=0 loops=2571)
Index Cond: (p.acct_account_transaction_fk = "outer".acct_account_transaction_id)
Filter: ((method)::text <> 'trade'::text)
-> Index Scan using contact_role_pk on contact_role (cost=0.00..4.48 rows=1 width=4) (actual time=0.007..0.007 rows=0 loops=670)
Index Cond: ("outer".contact_id = contact_role.contact_fk)
Filter: (exchange_fk = 74)
Total runtime: 221553.019 ms
Your problem is here:
-> Nested Loop Left Join (cost=0.00..22871.67 rows=1 width=16) (actual time=1025.396..221266.357 rows=2593 loops=1)
Join Filter: ("inner".orig_acct_payment_fk = "outer".acct_account_transaction_id)
Filter: ("inner".link_type IS NULL)
-> Seq Scan on acct_account_transaction t (cost=0.00..18222.98 rows=1 width=16) (actual time=949.081..976.432 rows=2596 loops=1)
Filter: ((("type")::text = 'debit'::text) AND ((transaction_status)::text = 'active'::text) AND (date_effective >= '2012-03-01'::date) AND (date_effective
Seq Scan on acct_payment_link l (cost=0.00..4648.68 rows=1 width=15) (actual time=1.073..84.610 rows=169 loops=2596)
Filter: ((link_type)::text ~~ 'return_%'::text)
It expects to find 1 row in acct_account_transaction, while it finds 2596, and similarly for the other table.
You did not mention Your postgres version (could You?), but this should do the trick:
SELECT DISTINCT
t.date_effective,
t.acct_account_transaction_id,
p.method,
t.amount,
c.business_name,
t.amount
FROM
contact c inner join contact_role on (c.contact_id=contact_role.contact_fk and contact_role.exchange_fk=74),
acct_account a, acct_payment p,
acct_account_transaction t
WHERE
p.acct_account_transaction_fk=t.acct_account_transaction_id
and t.type = 'debit'
and transaction_status = 'active'
and p.method != 'trade'
and t.date_effective >= '2012-03-01'
and t.date_effective < (date '2012-03-01' + interval '1 month')
and c.contact_id=a.contact_fk and a.acct_account_id = t.acct_account_fk
and not exists(
select * from acct_payment_link l
where orig_acct_payment_fk == acct_account_transaction_id
and link_type like 'return_%'
)
ORDER BY
t.date_effective DESC
Also, try setting appropriate statistics target for relevant columns. Link to the friendly manual: http://www.postgresql.org/docs/current/static/sql-altertable.html
What are your indexes, and have you analysed lately? It's doing a table scan on acct_account_transaction even though there are several criteria on that table:
type
date_effective
If there are no indexes on those columns, then a compound one one (type, date_effective) could help (assuming there are lots of rows that don't meet the criteria on those columns).
I remove my first suggestion, as it changes the nature of the query.
I see that there's too much time spent in the LEFT JOIN.
First thing is to try to make only a single scan of the acct_payment_link table. Could you try rewriting your query to:
... LEFT JOIN (SELECT * FROM acct_payment_link
WHERE link_type LIKE 'return_%') AS l ...
You should check your statistics, as there's difference between planned and returned numbers of rows.
You haven't included tables' and indexes' definitions, it'd be good to take a look on those.
You might also want to use contrib/pg_tgrm extension to build index on the acct_payment_link.link_type, but I would make this a last option to try out.
BTW, what is the PostgreSQL version you're using?
Your statement rewritten and formatted:
SELECT DISTINCT
t.date_effective,
t.acct_account_transaction_id,
p.method,
t.amount,
c.business_name,
t.amount
FROM contact c
JOIN contact_role cr ON cr.contact_fk = c.contact_id
JOIN acct_account a ON a.contact_fk = c.contact_id
JOIN acct_account_transaction t ON t.acct_account_fk = a.acct_account_id
JOIN acct_payment p ON p.acct_account_transaction_fk
= t.acct_account_transaction_id
LEFT JOIN acct_payment_link l ON orig_acct_payment_fk
= acct_account_transaction_id
-- missing table-qualification!
AND link_type like 'return_%'
-- missing table-qualification!
WHERE transaction_status = 'active' -- missing table-qualification!
AND cr.exchange_fk = 74
AND t.type = 'debit'
AND t.date_effective >= '2012-03-01'
AND t.date_effective < (date '2012-03-01' + interval '1 month')
AND p.method != 'trade'
AND l.link_type IS NULL
ORDER BY t.date_effective DESC;
Explicit JOIN statements are preferable. I reordered your tables according to your JOIN logic.
Why (date '2012-03-01' + interval '1 month') instead of date '2012-04-01'?
Some table qualifications are missing. In a complex statement like this that's bad style. May be hiding a mistake.
The key to performance are indexes where appropriate, proper configuration of PostgreSQL and accurate statistics.
General advice on performance tuning in the PostgreSQL wiki.