Query plan & perf. variations on another machine - sql

I'm not a PostgreSQL expert, and I've been struggling with this.
I have a rather simple query:
SELECT exists(SELECT 1 FROM res_users_log WHERE create_uid=u.id), count(1)
FROM res_users u
WHERE active=true
GROUP BY 1
Basically it counts the number of active users that have a log entry. Both tables are relatively big (~600k records each) and have indexes on their ids.
This query executes in ~500ms on our server, but completely hangs on my machine (same psql version, 9.3). My database is a restore of the server's dump so the indexes have been reindexed on import.
When I do an EXPLAIN ANALYZE of the query, I get different results on the server and on my machine.
Locally I get
HashAggregate (cost=78496.43..88302.28 rows=2 width=4) (actual time=518.003..518.003 rows=1 loops=1)
-> Index Scan using res_users_pkey on res_users u (cost=0.42..78496.35 rows=16 width=4) (actual time=51.393..517.969 rows=11 loops=1)
Index Cond: (id < 20)
Filter: active
Rows Removed by Filter: 7
SubPlan 1
-> Seq Scan on res_users_log (cost=0.00..9805.83 rows=2 width=0) (actual time=47.078..47.078 rows=1 loops=11)
Filter: (create_uid = u.id)
Rows Removed by Filter: 516910
Total runtime: 518.034 ms
(10 rows)
(had to add the id < 20 to have the query actually finish)
and on the server I get
HashAggregate (cost=5389666981.78..5389687409.80 rows=2 width=4) (actual time=532.664..532.665 rows=2 loops=1)
-> Seq Scan on res_users u (cost=0.00..5389664343.42 rows=527672 width=4) (actual time=256.169..467.829 rows=527661 loops=1)
Filter: active
Rows Removed by Filter: 381
SubPlan 1
-> Seq Scan on res_users_log (cost=0.00..10214.00 rows=1 width=0) (never executed)
Filter: (create_uid = u.id)
SubPlan 2
-> Seq Scan on res_users_log res_users_log_1 (cost=0.00..8800.60 rows=565360 width=4) (actual time=0.006..45.697 rows=547108 loops=1)
Total runtime: 532.757 ms
(10 rows)
I've been trying to determine why the query plans are different (and I don't understand the SubPlan 2 entry) and what could possibly make this query take more than 2hrs (killed it after that) on my laptop...
I've vacuumed both tables without any noticeable difference.
Any ideas of what could make this hang like that?

If you want active users that have a log entry, I would expect:
SELECT count(*)
FROM res_users u
WHERE u.active = true and
exists (SELECT 1 FROM res_users_log l WHERE l.create_uid = u.id);
Then indexes on res_users(active, id) and res_users_log(create_uid) would be optimal.

Related

Can I optimize this query with an index?

given this query:
SELECT count(u.*)
FROM res_users u
WHERE active=true AND
share=false AND
NOT exists(SELECT 1 FROM res_users_log WHERE create_uid=u.id);
It currently takes 10 seconds.
I tried to make it faster with these 2 index commands, but it didn't help.
CREATE INDEX CONCURRENTLY id_active_share_index ON res_users (id,active,share);
CREATE INDEX CONCURRENTLY create_uid_index ON res_users_log (create_uid);
I guess it's because of the "NOT exists" line, but I have no idea how to include it into an index.
EXPLAIN (ANALYZE, BUFFERS) gives me this output:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2815437.14..2815437.15 rows=1 width=8) (actual time=39174.365..39174.367 rows=1 loops=1)
Buffers: shared hit=124 read=112875 dirtied=70, temp read=98788 written=99211
-> Merge Anti Join (cost=2678572.70..2815437.09 rows=20 width=1064) (actual time=39174.360..39174.361 rows=0 loops=1)
Merge Cond: (u.id = res_users_log.create_uid)
Buffers: shared hit=124 read=112875 dirtied=70, temp read=98788 written=99211
-> Sort (cost=11.92..11.97 rows=20 width=1068) (actual time=5.577..5.602 rows=16 loops=1)
Sort Key: u.id
Sort Method: quicksort Memory: 79kB
Buffers: shared hit=53 read=5
-> Seq Scan on res_users u (cost=0.00..11.49 rows=20 width=1068) (actual time=0.050..5.519 rows=16 loops=1)
Filter: (active AND (NOT share))
Rows Removed by Filter: 33
Buffers: shared hit=49 read=5
-> Sort (cost=2678560.78..2716236.90 rows=15070449 width=4) (actual time=36258.796..38013.471 rows=15069209 loops=1)
Sort Key: res_users_log.create_uid
Sort Method: external merge Disk: 206464kB
Buffers: shared hit=71 read=112870 dirtied=70, temp read=98788 written=99211
-> Seq Scan on res_users_log (cost=0.00..263645.49 rows=15070449 width=4) (actual time=1.755..29961.086 rows=15069319 loops=1)
Buffers: shared hit=71 read=112870 dirtied=70
Planning Time: 0.889 ms
Execution Time: 39202.694 ms
(21 rows)
For this query:
SELECT count(*)
FROM res_users u
WHERE active = true AND
share = false AND
NOT exists (SELECT 1 FROM res_users_log rul WHERE rul.create_uid = u.id);
You want indexes on:
res_users(active, share, id)
res_users_log(create_uid)
Note that the ordering of the columns matters.
This index will make the query fast as lightning:
CREATE INDEX ON res_users_log (create_uid);

Why is Postgres query planner affected by LIMIT?

EXPLAIN ANALYZE SELECT "alerts"."id",
"alerts"."created_at",
't1'::text AS src_table
FROM "alerts"
INNER JOIN "devices"
ON "devices"."id" = "alerts"."device_id"
INNER JOIN "sites"
ON "sites"."id" = "devices"."site_id"
WHERE "sites"."cloud_id" = 111
AND "alerts"."created_at" >= '2019-08-30'
ORDER BY "created_at" DESC limit 9;
Limit (cost=1.15..36021.60 rows=9 width=16) (actual time=30.505..29495.765 rows=9 loops=1)
-> Nested Loop (cost=1.15..232132.92 rows=58 width=16) (actual time=30.504..29495.755 rows=9 loops=1)
-> Nested Loop (cost=0.86..213766.42 rows=57231 width=24) (actual time=0.029..29086.323 rows=88858 loops=1)
-> Index Scan Backward using alerts_created_at_index on alerts (cost=0.43..85542.16 rows=57231 width=24) (actual time=0.014..88.137 rows=88858 loops=1)
Index Cond: (created_at >= '2019-08-30 00:00:00'::timestamp without time zone)
-> Index Scan using devices_pkey on devices (cost=0.43..2.23 rows=1 width=16) (actual time=0.016..0.325 rows=1 loops=88858)
Index Cond: (id = alerts.device_id)
-> Index Scan using sites_pkey on sites (cost=0.29..0.31 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=88858)
Index Cond: (id = devices.site_id)
Filter: (cloud_id = 7231)
Rows Removed by Filter: 1
Total runtime: 29495.816 ms
Now we change to LIMIT 10:
EXPLAIN ANALYZE SELECT "alerts"."id",
"alerts"."created_at",
't1'::text AS src_table
FROM "alerts"
INNER JOIN "devices"
ON "devices"."id" = "alerts"."device_id"
INNER JOIN "sites"
ON "sites"."id" = "devices"."site_id"
WHERE "sites"."cloud_id" = 111
AND "alerts"."created_at" >= '2019-08-30'
ORDER BY "created_at" DESC limit 10;
Limit (cost=39521.79..39521.81 rows=10 width=16) (actual time=1.557..1.559 rows=10 loops=1)
-> Sort (cost=39521.79..39521.93 rows=58 width=16) (actual time=1.555..1.555 rows=10 loops=1)
Sort Key: alerts.created_at
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=5.24..39520.53 rows=58 width=16) (actual time=0.150..1.543 rows=11 loops=1)
-> Nested Loop (cost=4.81..16030.12 rows=2212 width=8) (actual time=0.137..0.643 rows=31 loops=1)
-> Index Scan using sites_cloud_id_index on sites (cost=0.29..64.53 rows=31 width=8) (actual time=0.014..0.057 rows=23 loops=1)
Index Cond: (cloud_id = 7231)
-> Bitmap Heap Scan on devices (cost=4.52..512.32 rows=270 width=16) (actual time=0.020..0.025 rows=1 loops=23)
Recheck Cond: (site_id = sites.id)
-> Bitmap Index Scan on devices_site_id_index (cost=0.00..4.46 rows=270 width=0) (actual time=0.006..0.006 rows=9 loops=23)
Index Cond: (site_id = sites.id)
-> Index Scan using alerts_device_id_index on alerts (cost=0.43..10.59 rows=3 width=24) (actual time=0.024..0.028 rows=0 loops=31)
Index Cond: (device_id = devices.id)
Filter: (created_at >= '2019-08-30 00:00:00'::timestamp without time zone)
Rows Removed by Filter: 12
Total runtime: 1.603 ms
alerts table has millions of records, other tables are counted in thousands.
I can already optimize the query by simply not using limit < 10. What I don't understand is why the LIMIT affects the performance. Perhaps there's a better way than hardcoding this magic number "10".
The number of result rows affects the PostgreSQL optimizer, because plans that return the first few rows quickly are not necessarily plans that return the whole result as fast as possible.
In your case, PostgreSQL thinks that for small values of LIMIT, it will be faster by scanning the alerts table in the order of the ORDER BY clause using an index and just join the other tables using a nested loop until it has found 9 rows.
The benefit of such a strategy is that it doesn't have to calculate the complete result of the join, then sort it and throw away all but the first few result rows.
The danger is that it takes longer than expected to find the 9 matching rows, and this is what hits you:
Index Scan Backward using alerts_created_at_index on alerts (cost=0.43..85542.16 rows=57231 width=24) (actual time=0.014..88.137 rows=88858 loops=1)
So PostgreSQL has to process 88858 rows and use a nested loop join (which is inefficient if it has to loop often) until it finds 9 result rows. This may be because it underestimates the selectivity of the conditions, or because the many matching rows all happen to have low created_at.
The number 10 just happens to be the cut-off point where PostgreSQL thinks it will no longer be more efficient to use that strategy, it is a value that will change as the data in the database change.
You can avoid using that plan altogether by using an ORDER BY clause that does not match the index:
ORDER BY (created_at + INTERVAL '0 days') DESC

Postgres similar query taking different time, not sure what's wrong

We have a query that runs on several different child tables created hourly and inherited from a base table, say tab_name1 and tab_name2. The query was working fine, but suddenly it started to perform badly for all the Childs since a particular date.
This is the query which works fine till tab_name_20180621*, not sure what happened after that.
SELECT
*
FROM
tab_name_201806220300
WHERE
id = 201806220300
AND col1 IN (
SELECT
col1
FROM
tab_name2_201806220300
WHERE
uid = 5452
AND id = 201806220300
);
The analyze output shows something like this, there's huge difference in execution time.
#1
Nested Loop Semi Join (cost=0.00..84762.11 rows=1 width=937) (actual time=117.599..117.599 rows=0 loops=1)
Join Filter: (tab_name_201806210100.col1 = tab_name2_201806210100.col1)
-> Seq Scan on tab_name_201806210100 (cost=0.00..31603.56 rows=1 width=937) (actual time=117.596..117.596 rows=0 loops=1)
Filter: (log_id = '201806220100'::bigint)
Rows Removed by Filter: 434045
-> Materialize (cost=0.00..53136.74 rows=1454 width=41) (never executed)
-> Seq Scan on tab_name2_201806210100 (cost=0.00..53129.47 rows=1454 width=41) (never executed)
Filter: ((uid = 5452) AND (log_id = '201806210100'::bigint))
Planning time: 1.490 ms
Execution time: 117.723 ms
#2
Nested Loop Semi Join (cost=0.00..10299.31 rows=48 width=1476) (actual time=1082.255..47041.945 rows=671 loops=1)
Join Filter: (tab_name_201806220100.col1 = tab_name2_201806220100.col1)
Rows Removed by Join Filter: 252444174
-> Seq Scan on tab_name_201806220100 (cost=0.00..4023.69 rows=95 width=1476) (actual time=0.008..36.292 rows=64153 loops=1)
Filter: (log_id = '201806220100'::bigint)
-> Materialize (cost=0.00..6274.19 rows=1 width=32) (actual time=0.000..0.264 rows=3935 loops=64153)
-> Seq Scan on tab_name2_201806220100 (cost=0.00..6274.19 rows=1 width=32) (actual time=0.464..55.475 rows=3960 loops=1)
Filter: ((log_id = '201806220100'::bigint) AND (uid = 5452))
Rows Removed by Filter: 140592
Planning time: 1.024 ms
Execution time: 47042.234 ms
I don't know what to infer from this and how to proceed from here. Could you help me?

PL/pgSQL Query Plan Worse Inside Function Than Outside

I have a function that is running too slow. I've isolated which piece of the function is slow.. a small SELECT statement:
SELECT image_group_id
FROM programs.image_family fam
JOIN programs.provider_file pf
ON (fam.provider_data_id = pf.provider_data_id
AND fam.family_id = $1 AND pf.image_group_id IS NOT NULL)
LIMIT 1
When I run the function this piece of SQL generates the following query plan:
Query Text: SELECT image_group_id FROM programs.image_family fam JOIN programs.provider_file pf ON (fam.provider_data_id = pf.provider_data_id AND fam.family_id = $1 AND pf.image_group_id IS NOT NULL) LIMIT 1
Limit (cost=0.56..6.75 rows=1 width=6) (actual time=3471.004..3471.004 rows=0 loops=1)
-> Nested Loop (cost=0.56..594054.42 rows=96017 width=6) (actual time=3471.002..3471.002 rows=0 loops=1)
-> Seq Scan on image_family fam (cost=0.00..391880.08 rows=96023 width=6) (actual time=3471.001..3471.001 rows=0 loops=1)
Filter: ((family_id)::numeric = '8419853'::numeric)
Rows Removed by Filter: 19204671
-> Index Scan using "IX_DBO_PROVIDER_FILE_1" on provider_file pf (cost=0.56..2.11 rows=1 width=12) (never executed)
Index Cond: (provider_data_id = fam.provider_data_id)
Filter: (image_group_id IS NOT NULL)
When I run the selected query in a query tool (outside of the function) the query plan looks like this:
Limit (cost=1.12..3.81 rows=1 width=6) (actual time=0.043..0.043 rows=1 loops=1)
Output: pf.image_group_id
Buffers: shared hit=11
-> Nested Loop (cost=1.12..14.55 rows=5 width=6) (actual time=0.041..0.041 rows=1 loops=1)
Output: pf.image_group_id
Inner Unique: true
Buffers: shared hit=11
-> Index Only Scan using image_family_family_id_provider_data_id_idx on programs.image_family fam (cost=0.56..1.65 rows=5 width=6) (actual time=0.024..0.024 rows=1 loops=1)
Output: fam.family_id, fam.provider_data_id
Index Cond: (fam.family_id = 8419853)
Heap Fetches: 2
Buffers: shared hit=6
-> Index Scan using "IX_DBO_PROVIDER_FILE_1" on programs.provider_file pf (cost=0.56..2.58 rows=1 width=12) (actual time=0.013..0.013 rows=1 loops=1)
Output: pf.provider_data_id, pf.provider_file_path, pf.posted_dt, pf.file_repository_id, pf.restricted_size, pf.image_group_id, pf.is_master, pf.is_biggest
Index Cond: (pf.provider_data_id = fam.provider_data_id)
Filter: (pf.image_group_id IS NOT NULL)
Buffers: shared hit=5
Planning time: 0.809 ms
Execution time: 0.100 ms
If I disable sequence scans in the function I can get a similar query plan:
Query Text: SELECT image_group_id FROM programs.image_family fam JOIN programs.provider_file pf ON (fam.provider_data_id = pf.provider_data_id AND fam.family_id = $1 AND pf.image_group_id IS NOT NULL) LIMIT 1
Limit (cost=1.12..8.00 rows=1 width=6) (actual time=3855.722..3855.722 rows=0 loops=1)
-> Nested Loop (cost=1.12..660217.34 rows=96017 width=6) (actual time=3855.721..3855.721 rows=0 loops=1)
-> Index Only Scan using image_family_family_id_provider_data_id_idx on image_family fam (cost=0.56..458043.00 rows=96023 width=6) (actual time=3855.720..3855.720 rows=0 loops=1)
Filter: ((family_id)::numeric = '8419853'::numeric)
Rows Removed by Filter: 19204671
Heap Fetches: 368
-> Index Scan using "IX_DBO_PROVIDER_FILE_1" on provider_file pf (cost=0.56..2.11 rows=1 width=12) (never executed)
Index Cond: (provider_data_id = fam.provider_data_id)
Filter: (image_group_id IS NOT NULL)
The query plans are different where the Filter functions are for the Index Only Scan. The function has more Heap Fetches and seems to treat the argument as a string casted to a numeric.
Things I've tried:
Increasing statistics (and running vacuum/analyze)
Calling the problematic piece of SQL in another function with language SQL
Add another index (the one that its using now to perform an INDEX ONLY scan)
Create a CTE for the image_family table (this did help performance but would still do a sequence scan on the image_family instead of using the index so still, too slow)
Change from executing raw SQL to using an EXECUTE ... INTO .. USING in the function.
Makeup of the two tables:
image_family:
provider_data_id: numeric(16)
family_id: int4
(rest omitted for brevity)
unique index on provider_data_id
index on family_id
I recently added a unique index on (family_id, provider_data_id) as well
Approximately 20 million rows here. Families have many provider_data_ids but not all provider_data_ids are part of families and thus aren't all in this table.
provider_file:
provider_data_id numeric(16)
image_group_id numeric(16)
(rest omitted for brevity)
unique index on provider_data_id
Approximately 32 million rows in this table. Most rows (> 95%) have a Non-Null image_group_id.
Postgres Version 10
How can I get the query performance to match whether I call it from a function or as raw SQL in a query tool?
The problem is exhibited in this line:
Filter: ((family_id)::numeric = '8419853'::numeric)
The index on family_id cannot be used because family_id is compared to a numeric value. This requires a cast to numeric, and there is no index on family_id::numeric.
Even though integer and numeric both are types representing numbers, their internal representation is quite different, and so the indexes are incompatible. In other words, the cast to numeric is like a function for PostgreSQL, and since it has no index on that functional expression, it has to resort to a scan of the whole table (or index).
The solution is simple, however: use an integer instead of a numeric parameter for the query. If in doubt, use a cast like
fam.family_id = $1::integer

How can I optimize this really slow query generated by Django?

here's my Django ORM query:
Group.objects.filter(public = True)\
.annotate(num_members = Count('members', distinct = True))\
.annotate(num_images = Count('images', distinct = True))\
.order_by(sort)
Unfortunately this is taking over 30 seconds even with just a few dozen Groups. Removing the annotate statements makes the query significantly faster at only 3 ms...
My database backend is Postgres and here's the SQL and explain:
Executed SQL
SELECT ••• FROM "astrobin_apps_groups_group"
LEFT OUTER JOIN "astrobin_apps_groups_group_members" ON (
"astrobin_apps_groups_group"."id" = "astrobin_apps_groups_group_members"."group_id"
)
LEFT OUTER JOIN "astrobin_apps_groups_group_images" ON (
"astrobin_apps_groups_group"."id" = "astrobin_apps_groups_group_images"."group_id")
WHERE "astrobin_apps_groups_group"."public" = true
GROUP BY
"astrobin_apps_groups_group"."id",
"astrobin_apps_groups_group"."date_created",
"astrobin_apps_groups_group"."date_updated",
"astrobin_apps_groups_group"."creator_id",
"astrobin_apps_groups_group"."owner_id",
"astrobin_apps_groups_group"."name",
"astrobin_apps_groups_group"."description",
"astrobin_apps_groups_group"."category",
"astrobin_apps_groups_group"."public",
"astrobin_apps_groups_group"."moderated",
"astrobin_apps_groups_group"."autosubmission",
"astrobin_apps_groups_group"."forum_id"
ORDER BY "astrobin_apps_groups_group"."date_updated" ASC
Time
30455.9268951 ms
QUERY PLAN
GroupAggregate (cost=5910.49..8288.54 rows=216 width=242) (actual time=29255.329..30269.284 rows=27 loops=1)
-> Sort (cost=5910.49..6068.88 rows=63357 width=242) (actual time=29253.278..29788.601 rows=201888 loops=1)
Sort Key: astrobin_apps_groups_group.date_updated, astrobin_apps_groups_group.id, astrobin_apps_groups_group.date_created, astrobin_apps_groups_group.creator_id, astrobin_apps_groups_group.owner_id, astrobin_apps_groups_group.name, astrobin_apps_groups_group.description, astrobin_apps_groups_group.category, astrobin_apps_groups_group.public, astrobin_apps_groups_group.moderated, astrobin_apps_groups_group.autosubmission, astrobin_apps_groups_group.forum_id
Sort Method: external merge Disk: 70176kB
-> Hash Right Join (cost=15.69..857.39 rows=63357 width=242) (actual time=1.903..397.613 rows=201888 loops=1)
Hash Cond: (astrobin_apps_groups_group_images.group_id = astrobin_apps_groups_group.id)
-> Seq Scan on astrobin_apps_groups_group_images (cost=0.00..106.05 rows=6805 width=8) (actual time=0.024..12.510 rows=6837 loops=1)
-> Hash (cost=12.31..12.31 rows=270 width=238) (actual time=1.853..1.853 rows=323 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 85kB
-> Hash Right Join (cost=3.63..12.31 rows=270 width=238) (actual time=0.133..1.252 rows=323 loops=1)
Hash Cond: (astrobin_apps_groups_group_members.group_id = astrobin_apps_groups_group.id)
-> Seq Scan on astrobin_apps_groups_group_members (cost=0.00..4.90 rows=290 width=8) (actual time=0.004..0.348 rows=333 loops=1)
-> Hash (cost=3.29..3.29 rows=27 width=234) (actual time=0.103..0.103 rows=27 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 7kB
-> Seq Scan on astrobin_apps_groups_group (cost=0.00..3.29 rows=27 width=234) (actual time=0.004..0.049 rows=27 loops=1)
Filter: public
Total runtime: 30300.606 ms
It would be great if somebody could suggest a way to optimize this. I feel like I'm missing a really low hanging fruit.
Thanks!
What are the indexes present on astrobin_apps_groups_group and "astrobin_apps_groups_group_member, astrobin_apps_groups_group_image table?
Is there any aggregate functions like SUM, COUNT used in your select? If no, then you can remove all columns from GROUP BY
The plan shows most of the time is taken for sorting. If you create an index on date_updated filed with NULLS LAST with latest values first in index, then planner may use this index for sorting.
For sorting, disk is getting used which is most costly affair. This is because your data collected for sorting is not fitting in memory. Try increasing the WORK_MEM - set work_mem='10MB'; SELECT.....