given this query:
SELECT count(u.*)
FROM res_users u
WHERE active=true AND
share=false AND
NOT exists(SELECT 1 FROM res_users_log WHERE create_uid=u.id);
It currently takes 10 seconds.
I tried to make it faster with these 2 index commands, but it didn't help.
CREATE INDEX CONCURRENTLY id_active_share_index ON res_users (id,active,share);
CREATE INDEX CONCURRENTLY create_uid_index ON res_users_log (create_uid);
I guess it's because of the "NOT exists" line, but I have no idea how to include it into an index.
EXPLAIN (ANALYZE, BUFFERS) gives me this output:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2815437.14..2815437.15 rows=1 width=8) (actual time=39174.365..39174.367 rows=1 loops=1)
Buffers: shared hit=124 read=112875 dirtied=70, temp read=98788 written=99211
-> Merge Anti Join (cost=2678572.70..2815437.09 rows=20 width=1064) (actual time=39174.360..39174.361 rows=0 loops=1)
Merge Cond: (u.id = res_users_log.create_uid)
Buffers: shared hit=124 read=112875 dirtied=70, temp read=98788 written=99211
-> Sort (cost=11.92..11.97 rows=20 width=1068) (actual time=5.577..5.602 rows=16 loops=1)
Sort Key: u.id
Sort Method: quicksort Memory: 79kB
Buffers: shared hit=53 read=5
-> Seq Scan on res_users u (cost=0.00..11.49 rows=20 width=1068) (actual time=0.050..5.519 rows=16 loops=1)
Filter: (active AND (NOT share))
Rows Removed by Filter: 33
Buffers: shared hit=49 read=5
-> Sort (cost=2678560.78..2716236.90 rows=15070449 width=4) (actual time=36258.796..38013.471 rows=15069209 loops=1)
Sort Key: res_users_log.create_uid
Sort Method: external merge Disk: 206464kB
Buffers: shared hit=71 read=112870 dirtied=70, temp read=98788 written=99211
-> Seq Scan on res_users_log (cost=0.00..263645.49 rows=15070449 width=4) (actual time=1.755..29961.086 rows=15069319 loops=1)
Buffers: shared hit=71 read=112870 dirtied=70
Planning Time: 0.889 ms
Execution Time: 39202.694 ms
(21 rows)
For this query:
SELECT count(*)
FROM res_users u
WHERE active = true AND
share = false AND
NOT exists (SELECT 1 FROM res_users_log rul WHERE rul.create_uid = u.id);
You want indexes on:
res_users(active, share, id)
res_users_log(create_uid)
Note that the ordering of the columns matters.
This index will make the query fast as lightning:
CREATE INDEX ON res_users_log (create_uid);
Related
I have the following SQL
WITH filtered_users_pre as (
SELECT value as username,row_number() OVER (partition by value) AS rk
FROM "user-stats".tag_table
WHERE _at_timestamp = 1626955200
AND tag in ('commercial','marketing')
),
filtered_users as (
SELECT username
FROM filtered_users_pre
WHERE rk = 2
),
valid_users as (
SELECT aa.username, aa.rank, aa.points, aa.version
FROM "users-results".ai_algo aa
WHERE aa._at_timestamp = 1626955200
AND aa.rank_timeframe = '7d'
AND aa.username IN (SELECT * FROM filtered_users)
ORDER BY aa.rank ASC
LIMIT 15
OFFSET 0
)
select * from valid_users;
"user-stats".tag_table is a table with around 60 million rows, with proper indexes.
"users-results".ai_algo is a table with around 10 million rows, with proper indexes.
With proper indexes I mean all the fields that appear in a WHERE clause above.
If filtered_users is empty, the query takes 4 seconds to run. If filtered_users has at least one row, it takes 400ms.
Anyone can explain me why? Is there I way I can have the query running with the same performance (400ms) also with filtered_users empty? I was expecting to get better performance with the reducing of number of rows in filtered_users. That's what happens up to 1 row. When the rows are 0, it takes 10 times more.
Of couse same happens if instead of IN clause in the WHERE, I put a INNER JOIN between ai_algo and filtered_users
Update
This is the EXPLAIN (ANALYZE, BUFFERS) output query when filtered_users has 0 rows (4 secs of execution)
Limit (cost=14592.13..15870.39 rows=15 width=35) (actual time=3953.945..3953.949 rows=0 loops=1)
Buffers: shared hit=7456641
-> Nested Loop Semi Join (cost=14592.13..1795382.62 rows=20897 width=35) (actual time=3953.944..3953.947 rows=0 loops=1)
Join Filter: (aa.username = filtered_users_pre.username)
Buffers: shared hit=7456641
-> Index Scan using ai_algo_202107_rank_timeframe_rank_idx on ai_algo_202107 aa (cost=0.56..1718018.61 rows=321495 width=35) (actual time=0.085..3885.547 rows=313611 loops=1)
" Index Cond: (rank_timeframe = '7d'::""valid-users-timeframe"")"
Filter: (_at_timestamp = 1626955200)
Rows Removed by Filter: 7793096
Buffers: shared hit=7456533
-> Materialize (cost=14591.56..14672.51 rows=13 width=21) (actual time=0.000..0.000 rows=0 loops=313611)
Buffers: shared hit=108
-> Subquery Scan on filtered_users_pre (cost=14591.56..14672.44 rows=13 width=21) (actual time=3.543..3.545 rows=0 loops=1)
Filter: (filtered_users_pre.rk = 2)
Rows Removed by Filter: 2415
Buffers: shared hit=108
-> WindowAgg (cost=14591.56..14638.74 rows=2696 width=29) (actual time=1.996..3.356 rows=2415 loops=1)
Buffers: shared hit=108
-> Sort (cost=14591.56..14598.30 rows=2696 width=21) (actual time=1.990..2.189 rows=2415 loops=1)
Sort Key: tag_table_20210722.value
Sort Method: quicksort Memory: 285kB
Buffers: shared hit=108
-> Bitmap Heap Scan on tag_table_20210722 (cost=146.24..14437.94 rows=2696 width=21) (actual time=0.612..1.080 rows=2415 loops=1)
" Recheck Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
Filter: (_at_timestamp = 1626955200)
Rows Removed by Filter: 2415
Heap Blocks: exact=72
Buffers: shared hit=105
-> Bitmap Index Scan on tag_table_20210722_tag_idx (cost=0.00..145.57 rows=5428 width=0) (actual time=0.292..0.292 rows=4830 loops=1)
" Index Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
Buffers: shared hit=33
Planning Time: 0.914 ms
Execution Time: 3954.035 ms
This is when filtered_users has at least 1 row (300ms)
Limit (cost=14592.13..15870.39 rows=15 width=35) (actual time=15.958..300.759 rows=15 loops=1)
Buffers: shared hit=11042
-> Nested Loop Semi Join (cost=14592.13..1795382.62 rows=20897 width=35) (actual time=15.957..300.752 rows=15 loops=1)
Join Filter: (aa.username = filtered_users_pre.username)
Rows Removed by Join Filter: 1544611
Buffers: shared hit=11042
-> Index Scan using ai_algo_202107_rank_timeframe_rank_idx on ai_algo_202107 aa (cost=0.56..1718018.61 rows=321495 width=35) (actual time=0.075..10.455 rows=645 loops=1)
" Index Cond: (rank_timeframe = '7d'::""valid-users-timeframe"")"
Filter: (_at_timestamp = 1626955200)
Rows Removed by Filter: 16124
Buffers: shared hit=10937
-> Materialize (cost=14591.56..14672.51 rows=13 width=21) (actual time=0.003..0.174 rows=2395 loops=645)
Buffers: shared hit=105
-> Subquery Scan on filtered_users_pre (cost=14591.56..14672.44 rows=13 width=21) (actual time=1.895..3.680 rows=2415 loops=1)
Filter: (filtered_users_pre.rk = 1)
Buffers: shared hit=105
-> WindowAgg (cost=14591.56..14638.74 rows=2696 width=29) (actual time=1.894..3.334 rows=2415 loops=1)
Buffers: shared hit=105
-> Sort (cost=14591.56..14598.30 rows=2696 width=21) (actual time=1.889..2.102 rows=2415 loops=1)
Sort Key: tag_table_20210722.value
Sort Method: quicksort Memory: 285kB
Buffers: shared hit=105
-> Bitmap Heap Scan on tag_table_20210722 (cost=146.24..14437.94 rows=2696 width=21) (actual time=0.604..1.046 rows=2415 loops=1)
" Recheck Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
Filter: (_at_timestamp = 1626955200)
Rows Removed by Filter: 2415
Heap Blocks: exact=72
Buffers: shared hit=105
-> Bitmap Index Scan on tag_table_20210722_tag_idx (cost=0.00..145.57 rows=5428 width=0) (actual time=0.287..0.287 rows=4830 loops=1)
" Index Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
Buffers: shared hit=33
Planning Time: 0.310 ms
Execution Time: 300.954 ms
The problem is that if there are no matching filtered_users, PostgreSQL has to go through all "users-results".ai_algo without finding 15 result rows. If the subquery contains elements, it quickly finds 15 matching "users-results".ai_algo rows and can terminate processing.
There is nothing you can do about that, but you can speed up the scan of "users-results".ai_algo. Currently, you have
-> Index Scan using ai_algo_202107_rank_timeframe_rank_idx on ai_algo_202107 aa
... (actual time=0.085..3885.547 rows=313611 loops=1)
Index Cond: (rank_timeframe = '7d'::"valid-users-timeframe")
Filter: (_at_timestamp = 1626955200)
Rows Removed by Filter: 7793096
Buffers: shared hit=7456533
You see that the index scan is not as effective as it could be: it reads 313611 + 7793096 = 8106707 rows from the table and discards all but the 313611 that match the filter condition.
You can do better by creating an index that can find only the result rows directly:
CREATE INDEX ON "users-results".ai_algo (rank_timeframe, _at_timestamp);
Then you can drop the index ai_algo_rank_timeframe_rank_idx, because the new index can do everything that the old one could do (and more).
I have this query:
select agg.app_id, sum(downloads) as downloads, sum(revenue) as revenue from (
SELECT distinct app_id FROM sdk_modules_apps
WHERE sdk_module_id = 27
) agg inner join revenue_hist_sdk ON revenue_hist_sdk.app_id = agg.app_id
group by agg.app_id
order by app_id asc, revenue desc
limit 30
This query takes about 1.5 minutes to execute. The EXPLAIN (analyze, buffers) is this:
Limit (cost=8498544.92..8498544.99 rows=30 width=68) (actual time=32005.449..32136.719 rows=30 loops=1)
" Buffers: shared hit=6491 read=801721 written=312, temp read=1468232 written=1470609"
-> Sort (cost=8498544.92..8499032.40 rows=194994 width=68) (actual time=32005.448..32005.449 rows=30 loops=1)
" Sort Key: sdk_modules_apps.app_id, (sum(revenue_hist_sdk.revenue)) DESC"
Sort Method: top-N heapsort Memory: 27kB
" Buffers: shared hit=2409 read=266228, temp read=494935 written=495735"
-> Finalize GroupAggregate (cost=7574427.43..8492785.88 rows=194994 width=68) (actual time=20995.772..31969.342 rows=160329 loops=1)
Group Key: sdk_modules_apps.app_id
" Buffers: shared hit=2409 read=266228, temp read=494935 written=495735"
-> Gather Merge (cost=7574427.43..8484986.12 rows=389988 width=68) (actual time=20995.663..31752.532 rows=371741 loops=1)
Workers Planned: 2
Workers Launched: 2
" Buffers: shared hit=6491 read=801721 written=312, temp read=1468232 written=1470609"
-> Partial GroupAggregate (cost=7573427.40..8438971.80 rows=194994 width=68) (actual time=20532.363..30555.301 rows=123914 loops=3)
Group Key: sdk_modules_apps.app_id
" Buffers: shared hit=6491 read=801721 written=312, temp read=1468232 written=1470609"
-> Merge Join (cost=7573427.40..8149761.54 rows=38171380 width=20) (actual time=20532.301..28720.598 rows=16933962 loops=3)
Merge Cond: (revenue_hist_sdk.app_id = sdk_modules_apps.app_id)
" Buffers: shared hit=6491 read=801721 written=312, temp read=1468232 written=1470609"
-> Sort (cost=7499645.14..7595073.59 rows=38171380 width=24) (actual time=20428.849..24388.498 rows=30537105 loops=3)
Sort Key: revenue_hist_sdk.app_id
Sort Method: external merge Disk: 1025856kB
Worker 0: Sort Method: external merge Disk: 1073648kB
Worker 1: Sort Method: external merge Disk: 948424kB
" Buffers: shared hit=328 read=800986, temp read=1466846 written=1469211"
-> Parallel Seq Scan on revenue_hist_sdk (cost=0.00..1183013.80 rows=38171380 width=24) (actual time=0.030..4294.558 rows=30537105 loops=3)
Buffers: shared hit=314 read=800986
-> Unique (cost=73782.26..75108.27 rows=194994 width=4) (actual time=103.447..181.850 rows=267078 loops=3)
" Buffers: shared hit=6163 read=735 written=312, temp read=1386 written=1398"
-> Sort (cost=73782.26..74445.27 rows=265203 width=4) (actual time=103.446..137.035 rows=267078 loops=3)
Sort Key: sdk_modules_apps.app_id
Sort Method: external merge Disk: 3696kB
Worker 0: Sort Method: external merge Disk: 3696kB
Worker 1: Sort Method: external merge Disk: 3696kB
" Buffers: shared hit=6163 read=735 written=312, temp read=1386 written=1398"
-> Bitmap Heap Scan on sdk_modules_apps (cost=3147.76..47560.79 rows=265203 width=4) (actual time=16.269..53.601 rows=267525 loops=3)
Recheck Cond: (sdk_module_id = 27)
Heap Blocks: exact=1560
Buffers: shared hit=6149 read=735 written=312
-> Bitmap Index Scan on sdk_modules_apps_sdk_module_id_index (cost=0.00..3081.45 rows=265203 width=0) (actual time=16.084..16.084 rows=267525 loops=3)
Index Cond: (sdk_module_id = 27)
Buffers: shared hit=1469 read=735 written=312
Planning Time: 0.131 ms
Execution Time: 32388.741 ms
The table revenue_hist_sdk contains 90000000 records, and I need to sort by downloads or revenue in this table with joining to sdk_modules_apps. The number of records from sdk_modules_apps is about 250000. I don't really get how to make it faster. I tried to create different indexes with different sorting, but sometimes It got worse.
The tables schemas:
sdk_modules_apps: sdk_module_id, app_id, installed, uninstalled
revenue_hist_sdk: app_id, utc_date, downloads, revenue
How to handle query long execution time issues when sorting in large tables?
Write the query using exists:
SELECT app_id, sum(downloads) as downloads, sum(revenue) as revenue
FROM revenue_hist_sdk rhs
WHERE EXISTS (SELECT 1
FROM sdk_modules_apps a
WHERE a.app_id = rhs.app_id AND a.sdk_module_id = 27
)
GROUP BY app_id
ORDER BY downloads desc
LIMIT 30;
Then I would suggest indexes on: sdk_modules_apps(app_id, sdk_module_id) and revenue_hist_sdk(app_id). The second index could also have downloads and revenue as additional keys.
I have the below query running on a postgres and sqlserver DB (Use top for SQL server). The sorting of the "change_sequence" value is causing a high cost in my query, is there any way to reduce the cost but maintain the same results?
Query:
SELECT tablename,
CAST(primary_key_values AS VARCHAR),
primary_key_fields,
CAST(min_sequence AS NUMERIC),
_changed_fieldlist,
_operation,
min_sequence
FROM (
SELECT 'memdep' AS tablename,
CONCAT_WS(',',dependant,mem_num) AS primary_key_values,
'dependant,mem_num,' AS primary_key_fields,
_change_sequence AS min_sequence,
ROW_NUMBER() OVER(partition by dependant,mem_num order by _change_sequence) AS rn,
_changed_fieldlist,
_operation
FROM mipbi_ods.memdep
WHERE mipbi_status = 'NEW'
) main
WHERE rn = 1
LIMIT 100
In essence what i'm looking for is the records from "memdep" where they have a "mipbi_status" of 'NEW' with the lowest "_change_sequence". Ive tried using a MIN() function instead of the ROW_NUMBER the speed is about the same cost is about 5 more.
Is there a way to reduce the cost/speed of the query. I have around 400 million records in this table if that helps.
Here is the query explained:
Limit (cost=3080.03..3080.53 rows=100 width=109) (actual time=17.633..17.648 rows=35 loops=1)
-> Unique (cost=3080.03..3089.04 rows=1793 width=109) (actual time=17.632..17.644 rows=35 loops=1)
-> Sort (cost=3080.03..3084.53 rows=1803 width=109) (actual time=17.631..17.634 rows=36 loops=1)
Sort Key: (concat_ws(','::text, memdet.mem_num))
Sort Method: quicksort Memory: 29kB
-> Bitmap Heap Scan on memdet (cost=54.39..2982.52 rows=1803 width=109) (actual time=16.853..17.542 rows=36 loops=1)
Recheck Cond: ((mipbi_status)::text = 'NEW'::text)
Heap Blocks: exact=8
-> Bitmap Index Scan on idx_mipbi_status_memdet (cost=0.00..53.94 rows=1803 width=0) (actual time=10.396..10.396 rows=38 loops=1)
Index Cond: ((mipbi_status)::text = 'NEW'::text)
Planning time: 0.201 ms
Execution time: 17.700 ms
I'm using a smaller table to show here, this isn't the 400 million record table, but indexes and all will be the same.
Here is the query plan for the large table:
Limit (cost=47148422.27..47149122.27 rows=100 width=113) (actual time=2407976.293..2407977.112 rows=100 loops=1)
Output: main.tablename, ((main.primary_key_values)::character varying), main.primary_key_fields, main.min_sequence, main._changed_fieldlist, main._operation, main.min_sequence
Buffers: shared hit=6269554 read=12205028 dirtied=1893 written=4566983, temp read=443831 written=1016025
-> Subquery Scan on main (cost=47148422.27..52102269.25 rows=707692 width=113) (actual time=2407976.292..2407977.100 rows=100 loops=1)
Output: main.tablename, (main.primary_key_values)::character varying, main.primary_key_fields, main.min_sequence, main._changed_fieldlist, main._operation, main.min_sequence
Filter: (main.rn = 1)
Buffers: shared hit=6269554 read=12205028 dirtied=1893 written=4566983, temp read=443831 written=1016025
-> WindowAgg (cost=47148422.27..50333038.19 rows=141538485 width=143) (actual time=2407976.288..2407977.080 rows=100 loops=1)
Output: 'claim', concat_ws(','::text, claim.gen_claimnum), 'gen_claimnum,', claim._change_sequence, row_number() OVER (?), claim._changed_fieldlist, claim._operation, claim.gen_claimnum
Buffers: shared hit=6269554 read=12205028 dirtied=1893 written=4566983, temp read=443831 written=1016025
-> Sort (cost=47148422.27..47502268.49 rows=141538485 width=39) (actual time=2407976.236..2407976.905 rows=100 loops=1)
Output: claim._change_sequence, claim.gen_claimnum, claim._changed_fieldlist, claim._operation
Sort Key: claim.gen_claimnum, claim._change_sequence
Sort Method: external merge Disk: 4588144kB
Buffers: shared hit=6269554 read=12205028 dirtied=1893 written=4566983, temp read=443831 written=1016025
-> Seq Scan on mipbi_ods.claim (cost=0.00..20246114.01 rows=141538485 width=39) (actual time=0.028..843181.418 rows=88042077 loops=1)
Output: claim._change_sequence, claim.gen_claimnum, claim._changed_fieldlist, claim._operation
Filter: ((claim.mipbi_status)::text = 'NEW'::text)
Rows Removed by Filter: 356194
Buffers: shared hit=6269554 read=12205028 dirtied=1893 written=4566983
Planning time: 8.796 ms
Execution time: 2408702.464 ms
I have the following table structure:
AdPerformance
id
ad_id
impressions
Targeting
value
AdActions
app_starts
Ad
id
name
parent_id
AdTargeting
id
targeting_
ad_id
Targeting
id
name
value
AdProduct
id
ad_id
name
I need to aggregate the data by targeting with restriction to product name , so I wrote the following query:
SELECT ad_performance.ad_id, targeting.value AS targeting_value,
sum(impressions) AS impressions,
sum(app_starts) AS app_starts
FROM ad_performance
LEFT JOIN ad on ad.id = ad_performance.ad_id
LEFT JOIN ad_actions ON ad_performance.id = ad_actions.ad_performance_id
RIGHT JOIN (
SELECT ad_id, value from targeting, ad_targeting
WHERE targeting.id = ad_targeting.id and targeting.name = 'gender'
) targeting ON targeting.ad_id = ad.parent_id
WHERE ad_performance.ad_id IN
(SELECT ad_id FROM ad_product WHERE product = 'iphone')
GROUP BY ad_performance.ad_id, targeting_value
However the above query in ANALYZE command takes about 5s for ~1M records.
Is there a way to improve it?
I do have indexes on foreign keys
UPDATED
See output of ANALYZE
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=5787.28..5789.87 rows=259 width=254) (actual time=3283.763..3286.015 rows=5998 loops=1)
Group Key: adobject_performance.ad_id, targeting.value
Buffers: shared hit=3400223
-> Nested Loop Left Join (cost=241.63..5603.63 rows=8162 width=254) (actual time=10.438..2774.664 rows=839720 loops=1)
Buffers: shared hit=3400223
-> Nested Loop (cost=241.21..1590.52 rows=8162 width=250) (actual time=10.412..703.818 rows=839720 loops=1)
Join Filter: (adobject.id = adobject_performance.ad_id)
Buffers: shared hit=36755
-> Hash Join (cost=240.78..323.35 rows=9 width=226) (actual time=10.380..20.332 rows=5998 loops=1)
Hash Cond: (ad_product.ad_id = ad.id)
Buffers: shared hit=190
-> HashAggregate (cost=128.98..188.96 rows=5998 width=4) (actual time=3.788..6.821 rows=5998 loops=1)
Group Key: ad_product.ad_id
Buffers: shared hit=39
-> Seq Scan on ad_product (cost=0.00..113.99 rows=5998 width=4) (actual time=0.011..1.726 rows=5998 loops=1)
Filter: ((product)::text = 'ft2_iPhone'::text)
Rows Removed by Filter: 1
Buffers: shared hit=39
-> Hash (cost=111.69..111.69 rows=9 width=222) (actual time=6.578..6.578 rows=5998 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 241kB
Buffers: shared hit=151
-> Hash Join (cost=30.26..111.69 rows=9 width=222) (actual time=0.154..4.660 rows=5998 loops=1)
Hash Cond: (adobject.parent_id = adobject_targeting.ad_id)
Buffers: shared hit=151
-> Seq Scan on adobject (cost=0.00..77.97 rows=897 width=8) (actual time=0.009..1.449 rows=6001 loops=1)
Buffers: shared hit=69
-> Hash (cost=30.24..30.24 rows=2 width=222) (actual time=0.132..0.132 rows=2 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
Buffers: shared hit=82
-> Nested Loop (cost=0.15..30.24 rows=2 width=222) (actual time=0.101..0.129 rows=2 loops=1)
Buffers: shared hit=82
-> Seq Scan on targeting (cost=0.00..13.88 rows=2 width=222) (actual time=0.015..0.042 rows=79 loops=1)
Filter: (name = 'age group'::targeting_name)
Rows Removed by Filter: 82
Buffers: shared hit=1
-> Index Scan using advertising_targeting_pkey on adobject_targeting (cost=0.15..8.17 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=79)
Index Cond: (id = targeting.id)
Buffers: shared hit=81
-> Index Scan using "fki_advertising_peformance_advertising_entity_id -> advertising" on adobject_performance (cost=0.42..89.78 rows=4081 width=32) (actual time=0.007..0.046 rows=140 loops=5998)
Index Cond: (ad_id = ad_product.ad_id)
Buffers: shared hit=36565
-> Index Scan using facebook_advertising_actions_pkey on facebook_adobject_actions (cost=0.42..0.48 rows=1 width=12) (actual time=0.001..0.002 rows=1 loops=839720)
Index Cond: (ad_performance.id = ad_performance_id)
Buffers: shared hit=3363468
Planning time: 1.525 ms
Execution time: 3287.324 ms
(46 rows)
Blindly shooting here, as we have not been provided with the result of the EXPLAIN, but still, Postgres should treat this query better if you take out your targeting table in a CTE:
WITH targeting AS
(
SELECT ad_id, value from targeting, ad_targeting
WHERE targeting.id = ad_targeting.id and targeting.name = 'gender'
)
SELECT ad_performance.ad_id, targeting.value AS targeting_value,
sum(impressions) AS impressions,
sum(app_starts) AS app_starts
FROM ad_performance
LEFT JOIN ad on ad.id = ad_performance.ad_id
LEFT JOIN ad_actions ON ad_performance.id = ad_actions.ad_performance_id
RIGHT JOIN targeting ON targeting.ad_id = ad.parent_id
WHERE ad_performance.ad_id IN
(SELECT ad_id FROM ad_product WHERE product = 'iphone')
GROUP BY ad_performance.ad_id, targeting_value
Taken from the Documentation:
A useful property of WITH queries is that they are evaluated only once
per execution of the parent query, even if they are referred to more
than once by the parent query or sibling WITH queries. Thus, expensive
calculations that are needed in multiple places can be placed within a
WITH query to avoid redundant work. Another possible application is to
prevent unwanted multiple evaluations of functions with side-effects.
The execution plan does not seem to match the query any more (maybe you can update the query).
However, the problem now is here:
-> Hash Join (cost=30.26..111.69 rows=9 width=222)
(actual time=0.154..4.660 rows=5998 loops=1)
Hash Cond: (adobject.parent_id = adobject_targeting.ad_id)
Buffers: shared hit=151
-> Seq Scan on adobject (cost=0.00..77.97 rows=897 width=8)
(actual time=0.009..1.449 rows=6001 loops=1)
Buffers: shared hit=69
-> Hash (cost=30.24..30.24 rows=2 width=222)
(actual time=0.132..0.132 rows=2 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
Buffers: shared hit=82
-> Nested Loop (cost=0.15..30.24 rows=2 width=222)
(actual time=0.101..0.129 rows=2 loops=1)
Buffers: shared hit=82
-> Seq Scan on targeting (cost=0.00..13.88 rows=2 width=222)
(actual time=0.015..0.042 rows=79 loops=1)
Filter: (name = 'age group'::targeting_name)
Rows Removed by Filter: 82
Buffers: shared hit=1
-> Index Scan using advertising_targeting_pkey on adobject_targeting
(cost=0.15..8.17 rows=1 width=8)
(actual time=0.001..0.001 rows=0 loops=79)
Index Cond: (id = targeting.id)
Buffers: shared hit=81
This is a join between adobject and the result of
targeting JOIN adobject_targeting
USING (id)
WHERE targeting.name = 'age group'
The latter subquery is correctly estimated to 2 rows, but PostgreSQL fails to notice that almost all rows found in adobject will match one of those two rows, so that the result of the join will be 6000 rather than the 9 it estimates.
This causes the optimizer to wrongly choose a nested loop join later on, where more than half of the query time is spent.
Unfortunately, since PostgreSQL doesn't have cross-table statistics, there is no way for PostgreSQL to know better.
One coarse measure is to SET enable_nestloop=off, but that will deteriorate the performance of the other (correctly chosen) nested loop join, so I don't know if it will be a net win.
If that helps, you could consider changing the parameter only for the duration of the query (with a transaction and SET LOCAL).
Maybe there is a way to rewrite the query so that a better plan can be found, but that is hard to say without knowing the exact query.
I dont know if this query will solve your problem, but try it:
SELECT ad_performance.ad_id, targeting.value AS targeting_value,
sum(impressions) AS impressions,
sum(app_starts) AS app_starts
FROM ad_performance
LEFT JOIN ad on ad.id = ad_performance.ad_id
LEFT JOIN ad_actions ON ad_performance.id = ad_actions.ad_performance_id
RIGHT JOIN ad_targeting on ad_targeting.ad_id = ad.parent_id
INNER JOIN targeting on targeting.id = ad_targeting.id and targeting.name = 'gender'
INNER JOIN ad_product on ad_product.ad_id = ad_performance.ad_id
WHERE ad_product.product = 'iphone'
GROUP BY ad_performance.ad_id, targeting_value
perhaps you would create index on all columns that you are putting in ON or WHERE conditions
I execute select * from a view on 3 tables (gis_wcdma_cells, wcdma_cells, wcdma_cells_statistic):
CREATE OR REPLACE VIEW gisview_wcdma_cell_statistic AS
SELECT gis_wcdma_cells.wcdma_cell_id,
gis_wcdma_cells.geometry,
wcdma_cells."Cellname",
wcdma_cells."Longitude",
wcdma_cells."Latitude",
wcdma_cells."Orientation",
wcdma_cell_statistic.avg_ecio,
wcdma_cell_statistic.avg_rssi,
wcdma_cell_statistic.message_count,
wcdma_cell_statistic.measurement_count
FROM gis_wcdma_cells,
wcdma_cells
LEFT JOIN wcdma_cell_statistic ON wcdma_cells.id = wcdma_cell_statistic.cell_id
WHERE gis_wcdma_cells.wcdma_cell_id = wcdma_cells.id;
However, I find it's quite slow although I have created indexes for 3 tables in my view:
hash index created in wcdma_cells.id
wcdma_cell_statistic.cell_id
gis_wcdma_cells.wcdma_cell_id
I find select * from view is much slower than select specific columns from the view. While SQL explanation for select * looks faster (I mean the cost is much lower in select * case.). Could you help to explain why? And if there is any suggestion to make the query faster?
SQL explaination for "select *" from view
Nested Loop Left Join (cost=0.00..11501.93 rows=41036 width=479) (actual time=0.034..326.680 rows=41036 loops=1)
Buffers: shared hit=238761
-> Nested Loop (cost=0.00..9017.44 rows=41036 width=467) (actual time=0.026..183.752 rows=41036 loops=1)
Buffers: shared hit=125390
-> Seq Scan on gis_wcdma_cells (cost=0.00..2690.36 rows=41036 width=396) (actual time=0.007..13.359 rows=41036 loops=1)
Buffers: shared hit=2280
-> Index Scan using wcdma_cells_id_idx on wcdma_cells (cost=0.00..0.14 rows=1 width=71) (actual time=0.002..0.003 rows=1 loops=41036)
Index Cond: (id = gis_wcdma_cells.wcdma_cell_id)
Rows Removed by Index Recheck: 0
Buffers: shared hit=123110
-> Index Scan using wcdma_cell_statistic_cell_id_idx on wcdma_cell_statistic (cost=0.00..0.05 rows=1 width=20) (actual time=0.002..0.002 rows=1 loops=41036)
Index Cond: (wcdma_cells.id = cell_id)
Rows Removed by Index Recheck: 0
Buffers: shared hit=113371
Total runtime: 334.551 ms
SQL explaination for select specific columns from my view
Nested Loop Left Join (cost=2340.31..9007.59 rows=41036 width=58) (actual time=46.498..263.589 rows=41036 loops=1)
Buffers: shared hit=116667, temp read=400 written=394
-> Hash Join (cost=2340.31..6523.10 rows=41036 width=58) (actual time=46.471..119.983 rows=41036 loops=1)
Hash Cond: (gis_wcdma_cells.wcdma_cell_id = wcdma_cells.id)
Buffers: shared hit=3296, temp read=400 written=394
-> Seq Scan on gis_wcdma_cells (cost=0.00..2690.36 rows=41036 width=4) (actual time=0.007..14.688 rows=41036 loops=1)
Buffers: shared hit=2280
-> Hash (cost=1426.36..1426.36 rows=41036 width=54) (actual time=46.358..46.358 rows=41036 loops=1)
Buckets: 2048 Batches: 4 Memory Usage: 927kB
Buffers: shared hit=1016, temp written=299
-> Seq Scan on wcdma_cells (cost=0.00..1426.36 rows=41036 width=54) (actual time=0.004..24.612 rows=41036 loops=1)
Buffers: shared hit=1016
-> Index Scan using wcdma_cell_statistic_cell_id_idx on wcdma_cell_statistic (cost=0.00..0.05 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=41036)
Index Cond: (wcdma_cells.id = cell_id)
Rows Removed by Index Recheck: 0
Buffers: shared hit=113371
Total runtime: 271.146 ms
I am running the SQL query in Navicat or PgAdmin tool. The query execution time is: 15 sec for selecting specific column and a few minutes for select *.