Optimize postgresql query - sql

I have 2 tables in PostgreSQL 9.1 - flight_2012_09_12 containing approx 500,000 rows and position_2012_09_12 containing about 5.5 million rows. I'm running a simple join query and it's taking a long time to complete and despite the fact the tables aren't small I'm convinced there are some major gains to be made in the execution.
The query is:
SELECT f.departure, f.arrival,
p.callsign, p.flightkey, p.time, p.lat, p.lon, p.altitude_ft, p.speed
FROM position_2012_09_12 AS p
JOIN flight_2012_09_12 AS f
ON p.flightkey = f.flightkey
WHERE p.lon < 0
AND p.time BETWEEN '2012-9-12 0:0:0' AND '2012-9-12 23:0:0'
The output of explain analyze is:
Hash Join (cost=239891.03..470396.82 rows=4790498 width=51) (actual time=29203.830..45777.193 rows=4403717 loops=1)
Hash Cond: (f.flightkey = p.flightkey)
-> Seq Scan on flight_2012_09_12 f (cost=0.00..1934.31 rows=70631 width=12) (actual time=0.014..220.494 rows=70631 loops=1)
-> Hash (cost=158415.97..158415.97 rows=3916885 width=43) (actual time=29201.012..29201.012 rows=3950815 loops=1)
Buckets: 2048 Batches: 512 (originally 256) Memory Usage: 1025kB
-> Seq Scan on position_2012_09_12 p (cost=0.00..158415.97 rows=3916885 width=43) (actual time=0.006..14630.058 rows=3950815 loops=1)
Filter: ((lon < 0::double precision) AND ("time" >= '2012-09-12 00:00:00'::timestamp without time zone) AND ("time" <= '2012-09-12 23:00:00'::timestamp without time zone))
Total runtime: 58522.767 ms
I think the problem lies with the sequential scan on the position table but I can't figure out why it's there. The table structures with indexes are below:
Table "public.flight_2012_09_12"
Column | Type | Modifiers
--------------------+-----------------------------+-----------
callsign | character varying(8) |
flightkey | integer |
source | character varying(16) |
departure | character varying(4) |
arrival | character varying(4) |
original_etd | timestamp without time zone |
original_eta | timestamp without time zone |
enroute | boolean |
etd | timestamp without time zone |
eta | timestamp without time zone |
equipment | character varying(6) |
diverted | timestamp without time zone |
time | timestamp without time zone |
lat | double precision |
lon | double precision |
altitude | character varying(7) |
altitude_ft | integer |
speed | character varying(4) |
asdi_acid | character varying(4) |
enroute_eta | timestamp without time zone |
enroute_eta_source | character varying(1) |
Indexes:
"flight_2012_09_12_flightkey_idx" btree (flightkey)
"idx_2012_09_12_altitude_ft" btree (altitude_ft)
"idx_2012_09_12_arrival" btree (arrival)
"idx_2012_09_12_callsign" btree (callsign)
"idx_2012_09_12_departure" btree (departure)
"idx_2012_09_12_diverted" btree (diverted)
"idx_2012_09_12_enroute_eta" btree (enroute_eta)
"idx_2012_09_12_equipment" btree (equipment)
"idx_2012_09_12_etd" btree (etd)
"idx_2012_09_12_lat" btree (lat)
"idx_2012_09_12_lon" btree (lon)
"idx_2012_09_12_original_eta" btree (original_eta)
"idx_2012_09_12_original_etd" btree (original_etd)
"idx_2012_09_12_speed" btree (speed)
"idx_2012_09_12_time" btree ("time")
Table "public.position_2012_09_12"
Column | Type | Modifiers
-------------+-----------------------------+-----------
callsign | character varying(8) |
flightkey | integer |
time | timestamp without time zone |
lat | double precision |
lon | double precision |
altitude | character varying(7) |
altitude_ft | integer |
course | integer |
speed | character varying(4) |
trackerkey | integer |
the_geom | geometry |
Indexes:
"index_2012_09_12_altitude_ft" btree (altitude_ft)
"index_2012_09_12_callsign" btree (callsign)
"index_2012_09_12_course" btree (course)
"index_2012_09_12_flightkey" btree (flightkey)
"index_2012_09_12_speed" btree (speed)
"index_2012_09_12_time" btree ("time")
"position_2012_09_12_flightkey_idx" btree (flightkey)
"test_index" btree (lon)
"test_index_lat" btree (lat)
I can't think of any other way to rewrite the query and so I'm stumped at this point. If the current setup is as good as it gets so be it but it seems to me that it should be much faster than it currently is. Any help would be much appreciated.

The row count estimates are pretty reasonable, so I doubt this is a stats issue.
I'd try:
Creating an index on position_2012_09_12(lon,"time") or possibly a partial index on position_2012_09_12("time") WHERE (lon < 0) if you routinely search for lon < 0.
Setting random_page_cost lower, maybe 1.1. See if (a) this changes the plan and (b) if the new plan is actually faster. For testing purposes to see if avoiding a seqscan would be faster you can SET enable_seqscan = off; if it is, change the cost paramters.
Increase work_mem for this query. SET work_mem = 10M or something before running it.
Running the latest PostgreSQL if you aren't already. Always specify your PostgreSQL version in questions. (Update after edit): You're on 9.1; that's fine. The biggest performance improvement in 9.2 was index-only scans, and it doesn't seem likely that you'd benefit massively from index-only scans for this query.
You'll also somewhat improve performance if you can get rid of columns to narrow the rows. It won't make tons of difference, but it'll make some.

The reason you are getting a sequential scan is that Postgres believes that it will read less disk pages that way than using indexes. It is probably right. Consider, if you use a non-covering index, you need to read all the matching index pages. it essentially outputs a list of row identifiers. The DB engine then needs to read each of the matching data pages.
Your position table uses 71 bytes per row, plus whatever a geom type takes (I'll assume 16 bytes for illustration), making 87 bytes. A Postgres page is 8192 bytes. So you have approximately 90 rows per pages.
Your query matches 3950815 out of 5563070 rows, or about 70% of the total. Assuming the data is randomly distributed, with regard to your where filters, there is a pretty much a 30% ^ 90 chance of finding a data page with no matching row. This is essentially nothing. So regardless of how good your indexes are, you're still going to have to read all the data pages. If you're going to have to read all the pages anyway, a table scan is usually a good approach.
The one get out here, is that I said non-covering index. If you are prepared to create indexes that can answer queries in of themselves, you can avoid looking up the data pages at all, so you are back in the game. I'd suggest the following are worth looking at:
flight_2012_09_12 (flightkey, departure, arrival)
position_2012_09_12 (filghtkey, time, lon, ...)
position_2012_09_12 (lon, time, flightkey, ...)
position_2012_09_12 (time, long, flightkey, ...)
The dots here represent the rest of the columns you are selecting. You'll only need one of the indexes on position, but it's hard to tell which will prove the best. The first approach may permit a merge join on presorted data, with the cost of reading the whole second index to do the filtering. The second and third will allow data to be prefiltered, but require a hash join. Give how much of the cost appears to be in the hash join, the merge join might well be a good option.
As your query requires 52 of the 87 bytes per row, and indexes have overheads, you may not end up with the index taking much, if any, less space then the table itself.
Another approach is to attack the "randomly distributed" side of it, by looking at clustering.

Related

Efficiency problem querying postgresql table

I have the following PostgreSQL table with about 67 million rows, which stores the EOD prices for all the US stocks starting in 1985:
Table "public.eods"
Column | Type | Collation | Nullable | Default
--------+-----------------------+-----------+----------+---------
stk | character varying(16) | | not null |
dt | date | | not null |
o | integer | | not null |
hi | integer | | not null |
lo | integer | | not null |
c | integer | | not null |
v | integer | | |
Indexes:
"eods_pkey" PRIMARY KEY, btree (stk, dt)
"eods_dt_idx" btree (dt)
I would like to query efficiently the table above based on either the stock name or the date. The primary key of the table is stock name and date. I have also defined an index on the date column, hoping to improve performance for queries that retrieve all the records for a specific date.
Unfortunately, I see a big difference in performance for the queries below. While getting all the records for a specific stock takes a decent amount of time to complete (2 seconds), getting all the records for a specific date takes much longer (about 56 seconds). I have tried to analyze these queries using explain analyze, and I have got the results below:
explain analyze select * from eods where stk='MSFT';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on eods (cost=169.53..17899.61 rows=4770 width=36) (actual time=207.218..2142.215 rows=8364 loops=1)
Recheck Cond: ((stk)::text = 'MSFT'::text)
Heap Blocks: exact=367
-> Bitmap Index Scan on eods_pkey (cost=0.00..168.34 rows=4770 width=0) (actual time=187.844..187.844 rows=8364 loops=1)
Index Cond: ((stk)::text = 'MSFT'::text)
Planning Time: 577.906 ms
Execution Time: 2143.101 ms
(7 rows)
explain analyze select * from eods where dt='2010-02-22';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Index Scan using eods_dt_idx on eods (cost=0.56..25886.45 rows=7556 width=36) (actual time=40.047..56963.769 rows=8143 loops=1)
Index Cond: (dt = '2010-02-22'::date)
Planning Time: 67.876 ms
Execution Time: 56970.499 ms
(4 rows)
I really cannot understand why the second query runs 28 times slower than the first query. They retrieve a similar number of records, they both seem to be using an index. So could somebody please explain to me why this difference in performance, and can I do something to improve the performance of the queries that retrieve all the records for a specific date?
I would guess that this has to do with the data layout. I am guessing that you are loading the data by stk, so the rows for a given stk are on a handful of pages that pretty much only contain that stk.
So, the execution engine is only reading about 25 pages.
On the other hand, no single page contains two records for the same date. When you read by date, you have to read about 7,556 pages. That is, about 300 times the number of pages.
The scaling must also take into account the work for loading and reading the index. This should be about the same for the two queries, so the ratio is less than a factor of 300.
There can be more issues - so it is hard to say where is a problem. Index scan should be usually faster, than bitmap heap scan - if not, then there can be following problems:
unhealthy index - try to run REINDEX INDEX indexname
bad statistics - try to run ANALYZE tablename
suboptimal state of table - try to run VACUUM tablename
too low, or to high setting of effective_cache_size
issues with IO - some systems has a problem with high random IO, try to increase random_page_cost
Investigation what is a issue is little bit alchemy - but it is possible - there are only closed set of very probably issues. Good start is
VACUUM ANALYZE tablename
benchmark your IO if it is possible (like bonie++)
To find the difference, you'll probably have to run EXPLAIN (ANALYZE, BUFFERS) on the query so that you see how many blocks are touched and where they come from.
I can think of two reasons:
Bad statistics that make PostgreSQL believe that dt has a high correlation while it has not. If the correlation is low, a bitmap index scan is often more efficient.
To see if that is the problem, run
ANALYZE eods;
and see if that changes the execution plans chosen.
Caching effects: perhaps the first query finds all required blocks already cached, while the second doesn't.
At any rate, it might be worth experimenting to see if a bitmap index scan would be cheaper for the second query:
SET enable_indexscan = off;
Then repeat the query.

Optimize Postgres deletion of orphaned records

Take the following two tables:
Table "public.contacts"
Column | Type | Modifiers | Storage | Stats target | Description
--------------------+-----------------------------+-------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('contacts_id_seq'::regclass) | plain | |
created_at | timestamp without time zone | not null | plain | |
updated_at | timestamp without time zone | not null | plain | |
external_id | integer | | plain | |
email_address | character varying | | extended | |
first_name | character varying | | extended | |
last_name | character varying | | extended | |
company | character varying | | extended | |
industry | character varying | | extended | |
country | character varying | | extended | |
region | character varying | | extended | |
ext_instance_id | integer | | plain | |
title | character varying | | extended | |
Indexes:
"contacts_pkey" PRIMARY KEY, btree (id)
"index_contacts_on_ext_instance_id_and_external_id" UNIQUE, btree (ext_instance_id, external_id)
and
Table "public.members"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------------+-----------------------------+--------------------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('members_id_seq'::regclass) | plain | |
step_id | integer | | plain | |
contact_id | integer | | plain | |
rule_id | integer | | plain | |
request_id | integer | | plain | |
sync_id | integer | | plain | |
status | integer | not null default 0 | plain | |
matched_targeted_rule | boolean | default false | plain | |
external_fields | jsonb | | extended | |
imported_at | timestamp without time zone | | plain | |
campaign_id | integer | | plain | |
ext_instance_id | integer | | plain | |
created_at | timestamp without time zone | | plain | |
Indexes:
"members_pkey" PRIMARY KEY, btree (id)
"index_members_on_contact_id_and_step_id" UNIQUE, btree (contact_id, step_id)
"index_members_on_campaign_id" btree (campaign_id)
"index_members_on_step_id" btree (step_id)
"index_members_on_sync_id" btree (sync_id)
"index_members_on_request_id" btree (request_id)
"index_members_on_status" btree (status)
Indices exist for both primary keys and members.contact_id.
I need to delete any contact which has no related members. There are roughly 3MM contact and 25MM member records.
I'm attempting the following two queries:
Query 1:
DELETE FROM "contacts"
WHERE "contacts"."id" IN (SELECT "contacts"."id"
FROM "contacts"
LEFT OUTER JOIN members
ON
members.contact_id = contacts.id
WHERE members.id IS NULL);
DELETE 0
Time: 173033.801 ms
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2654306.79..2654307.86 rows=1 width=18) (actual time=188717.354..188717.354 rows=0 loops=1)
-> Nested Loop (cost=2654306.79..2654307.86 rows=1 width=18) (actual time=188717.351..188717.351 rows=0 loops=1)
-> HashAggregate (cost=2654306.36..2654306.37 rows=1 width=16) (actual time=188717.349..188717.349 rows=0 loops=1)
Group Key: contacts_1.id
-> Hash Right Join (cost=161177.46..2654306.36 rows=1 width=16) (actual time=188717.345..188717.345 rows=0 loops=1)
Hash Cond: (members.contact_id = contacts_1.id)
Filter: (members.id IS NULL)
Rows Removed by Filter: 26725870
-> Seq Scan on members (cost=0.00..1818698.96 rows=25322396 width=14) (actual time=0.043..160226.686 rows=26725870 loops=1)
-> Hash (cost=105460.65..105460.65 rows=3205265 width=10) (actual time=1962.612..1962.612 rows=3196180 loops=1)
Buckets: 262144 Batches: 4 Memory Usage: 34361kB
-> Seq Scan on contacts contacts_1 (cost=0.00..105460.65 rows=3205265 width=10) (actual time=0.011..950.657 rows=3196180 loops=1)
-> Index Scan using contacts_pkey on contacts (cost=0.43..1.48 rows=1 width=10) (never executed)
Index Cond: (id = contacts_1.id)
Planning time: 0.488 ms
Execution time: 188718.862 ms
Query 2:
DELETE FROM contacts
WHERE NOT EXISTS (SELECT 1
FROM members c
WHERE c.contact_id = contacts.id);
DELETE 0
Time: 170871.219 ms
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2258873.91..2954594.50 rows=1895601 width=12) (actual time=177523.034..177523.034 rows=0 loops=1)
-> Hash Anti Join (cost=2258873.91..2954594.50 rows=1895601 width=12) (actual time=177523.029..177523.029 rows=0 loops=1)
Hash Cond: (contacts.id = c.contact_id)
-> Seq Scan on contacts (cost=0.00..105460.65 rows=3205265 width=10) (actual time=0.018..1068.357 rows=3196180 loops=1)
-> Hash (cost=1818698.96..1818698.96 rows=25322396 width=10) (actual time=169587.802..169587.802 rows=26725870 loops=1)
Buckets: 262144 Batches: 32 Memory Usage: 36228kB
-> Seq Scan on members c (cost=0.00..1818698.96 rows=25322396 width=10) (actual time=0.052..160081.880 rows=26725870 loops=1)
Planning time: 0.901 ms
Execution time: 177524.526 ms
As you can see that without even deleting any records both queries show similar performance taking ~3 minutes.
The server disk I/O spikes to 100% so I'm assuming that data is being spilled out to the disk because a sequential scan is done on both contacts and members.
The server is an EC2 r3.large (15GB RAM).
Any ideas on what I can do to optimize this query?
Update #1:
After running vacuum analyze for both tables and ensuring enable_mergejoin is set to on there is no difference in the query time:
DELETE FROM contacts
WHERE NOT EXISTS (SELECT 1
FROM members c
WHERE c.contact_id = contacts.id);
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.342..209406.342 rows=0 loops=1)
-> Hash Anti Join (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.338..209406.338 rows=0 loops=1)
Hash Cond: (contacts.id = c.contact_id)
-> Seq Scan on contacts (cost=0.00..105683.28 rows=3227528 width=10) (actual time=0.008..1010.643 rows=3227462 loops=1)
-> Hash (cost=1814029.74..1814029.74 rows=24855474 width=10) (actual time=198054.302..198054.302 rows=27307060 loops=1)
Buckets: 262144 Batches: 32 Memory Usage: 37006kB
-> Seq Scan on members c (cost=0.00..1814029.74 rows=24855474 width=10) (actual time=1.132..188654.555 rows=27307060 loops=1)
Planning time: 0.328 ms
Execution time: 209408.040 ms
Update 2:
PG Version:
PostgreSQL 9.4.4 on x86_64-pc-linux-gnu, compiled by x86_64-pc-linux-gnu-gcc (Gentoo Hardened 4.5.4 p1.0, pie-0.4.7) 4.5.4, 64-bit
Relation size:
Table | Size | External Size
-----------------------+---------+---------------
members | 23 GB | 11 GB
contacts | 944 MB | 371 MB
Settings:
work_mem
----------
64MB
random_page_cost
------------------
4
Update 3:
Experimenting with doing this in batches doesn't seem to help out on the I/O usage (still spikes to 100%) and doesn't seem to improve on time despite using index-based plans.
DO $do$
BEGIN
FOR i IN 57..668
LOOP
DELETE
FROM contacts
WHERE contacts.id IN
(
SELECT contacts.id
FROM contacts
left outer join members
ON members.contact_id = contacts.id
WHERE members.id IS NULL
AND contacts.id >= (i * 10000)
AND contacts.id < ((i+1) * 10000));
END LOOP;END $do$;
I had to kill the query after Time: 1203492.326 ms and disk I/O stayed at 100% the entire time the query ran. I also experimented with 1,000 and 5,000 chunks but did not see any increase in performance.
Note: The 57..668 range was used because I know those are existing contact IDs. (E.g. min(id) and max(id))
One approach to problems like this can be to do it in smaller chunks.
DELETE FROM "contacts"
WHERE "contacts"."id" IN (
SELECT id
FROM contacts
LEFT OUTER JOIN members ON members.contact_id = contacts.id
WHERE members.id IS NULL
AND id >= 1 AND id < 1000
);
DELETE FROM "contacts"
WHERE "contacts"."id" IN (
SELECT id
FROM contacts
LEFT OUTER JOIN members ON members.contact_id = contacts.id
WHERE members.id IS NULL
AND id >= 1001 AND id < 2000
);
Rinse, repeat. Experiment with different chunk sizes to find an optimal one for your data set, which uses the fewest queries, while keeping them all in memory.
Naturally, you would want to script this, possibly in plpgsql, or in whatever scripting language you prefer.
Any ideas on what I can do to optimize this query?
Your queries are perfect. I would use the NOT EXISTS variant.
Your index index_members_on_contact_id_and_step_id is also good for it:
Is a composite index also good for queries on the first field?
But see below about BRIN indexes.
You can tune your server, table and index configuration.
Since you do not actually update or delete many rows (hardly any at all, according to your comment?), you need to optimize read performance.
1. Upgrade your Postgres version
You provided:
The server is an EC2 r3.large (15GB RAM).
And:
PostgreSQL 9.4.4
Your version is seriously outdated. At least upgrade to the latest minor version. Better yet, upgrade to the current major version. Postgres 9.5 and 9.6 brought major improvements for big data - which is what you need exactly.
Consider the versioning policy of the project.
Amazon allows you to upgrade!
2. Improve table statistics
There is an unexpected 10% mismatch between expected and actual row count in the basic sequential scan:
Seq Scan on members c (cost=0.00..1814029.74 rows=24855474 width=10) (actual time=1.132..188654.555 rows=27307060 loops=1)
Not dramatic at all, but still should not occur in this query. Indicates that you might have to tune your autovacuum settings - possibly per table for the very big ones.
More problematic:
Hash Anti Join (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.338..209406.338 rows=0 loops=1)
Postgres expects to find 1875003 rows to delete, while actually 0 rows are found. That's unexpected. Maybe substantially increasing the statistics target on members.contact_id and contacts.id can help to decrease the gap, which might allow better query plans. See:
Keep PostgreSQL from sometimes choosing a bad query plan
3. Avoid table and index bloat
Your ~ 25MM rows in members occupy 23 GB - that's almost 1kb per row, which seems excessive for the table definition you presented (even if the total size you provided should include indexes):
4 bytes item identifier
24 tuple header
8 null bitmap
36 9x integer
16 2x ts
1 1x bool
?? 1x jsonb
See:
Making sense of Postgres row sizes
That's 89 bytes per row - or less with some NULL values - and hardly any alignment padding, so 96 bytes max, plus your jsonb column.
Either that jsonb column is very big which would make me suggest to normalize the data into separate columns or a separate table. Consider:
How to perform update operations on columns of type JSONB in Postgres 9.4
Or your table is bloated, which can be solved with VACUUM FULL ANALYZE or, while being at it:
CLUSTER members USING index_members_on_contact_id_and_step_id;
VACUUM members;
But either takes an exclusive lock on the table, which you say you cannot afford. pg_repack can do it without exclusive lock. See:
VACUUM returning disk space to operating system
Even if we factor in index sizes, your table seems too big: you have 7 small indexes, each 36 - 44 bytes per row without bloat, less with NULL values, so < 300 bytes altogether.
Either way, consider more aggressive autovacuum settings for your table members. Related:
Aggressive Autovacuum on PostgreSQL
What fillfactor for caching table?
And / or stop bloating the table to begin with. Are you updating rows a lot? Any particular column you update a lot? That jsonb column maybe? You might move that to a separate (1:1) table just to stop bloating the main table with dead tuples - and keeping autovacuum from doing its job.
4. Try a BRIN index
Block range indexes require Postgres 9.5 or later and dramatically reduce index size. I was too optimistic in my first draft. A BRIN index is perfect for your use case if you have many rows in members for each contact.id - after physically clustering your table at least once (see ③ for the fitting CLUSTER command). In that case Postgres can rule out whole data pages quickly. But your numbers indicate only around 8 rows per contact.id, so data pages would often contain multiple values, which voids much of the effect. Depends on actual details of your data distribution ...
On the other hand, as it stands, your tuple size is around 1 kb, so only ~ 8 rows per data page (typically 8kb). If that isn't mostly bloat, a BRIN index might help after all.
But you need to upgrade your server version first. See ①.
CREATE INDEX members_contact_id_brin_idx ON members USING BRIN (contact_id);
Update statistics used by the planner and set enable_mergejoin to on:
vacuum analyse members;
vacuum analyse contacts;
set enable_mergejoin to on;
You should get a query plan similar to this one:
explain analyse
delete from contacts
where not exists (
select 1
from members c
where c.contact_id = contacts.id);
QUERY PLAN
----------------------------------------------------------------------
Delete on contacts
-> Merge Anti Join
Merge Cond: (contacts.id = c.contact_id)
-> Index Scan using contacts_pkey on contacts
-> Index Scan using members_contact_id_idx on members c
Here is another variant to try:
DELETE FROM contacts
USING contacts c
LEFT JOIN members m
ON c.id = m.contact_id
WHERE m.contact_id IS NULL;
It uses a technique for deleting from a joined query described here.
I can't vouch for whether this would definitely be faster but it might be because of the avoidance of a subquery. Would be interested in the results...
Using subquery in where clause take a lot of time
you should use with and using this will be a lot a lot a lot ... faster
with
c_not_member as (
-- here extarct the id of contacts that not in members
SELECT
c.id
FROM contacts c LEFT JOIN members m on c.id = m.contact_id
WHERE
-- to get the contact that don't exist in member just
-- use condition in a field on member that cannot be null
-- in this case you have id
m.id is null
-- the only case when m.id is null is when c.id does not have m.contact_id maching c.id
-- in another way c.id doesn't exists in m.contact_id
)
DELETE FROM contacts all_c using c_not_member WHERE all_c.id = not_member.id ;

Tuning query using index. Best approach?

I am around a problem here. I'm using Oracle 11g and I have this query:
SELECT /*+ PARALLEL(16) */
prdecdde,
prdenusi,
prdenpol,
prdeano,
prdedtpr
FROM stat_pro_det
WHERE prdeisin IS NULL AND PRDENUSI IS NOT NULL AND prdedprv = '20160114'
GROUP BY prdecdde,
prdenusi,
prdenpol,
prdeano,
prdedtpr;
I get the next execution plan:
--------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 53229 | 2287K| | 3652 (4)| 00:00:01 |
| 1 | HASH GROUP BY | | 53229 | 2287K| 3368K| 3652 (4)| 00:00:01 |
|* 2 | TABLE ACCESS BY INDEX ROWID| STAT_PRO_DET | 53229 | 2287K| | 3012 (3)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | STAT_PRO_DET_08 | 214K| | | 626 (4)| 00:00:01 |
--------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("PRDENUSI" IS NOT NULL AND "PRDEISIN" IS NULL)
3 - access("PRDEDPRV"='20160114')
Note
-----
- Degree of Parallelism is 1 because of hint
I still have a lot of CPU cost. The STAT_PRO_DET_08 index is:
CREATE INDEX STAT_PRO_DET_08 ON STAT_PRO_DET(PRDEDPRV)
I've tried to add PRDEISIN and PRDENUSI to the index, putting the most selective at first, but with worst results.
This table have 128 million records (yes...maybe we need to a PARTITION TABLE). But I can not partition the table for now.
What are the other sugestions? A different index could get better results or can not do better than this?
Thanks in advance!!!!
EDIT1:
Guys.. thanks a lot for all your help. Especially #Marmite
I have a next question: And adding these two queries to the subject. Create one index for each one or can I have a index that resolve my performance problem in these three queries?
SELECT /*+ PARALLEL(16) */
prdecdde,
prdenuau,
prdenpol,
prdeano,
prdedtpr
FROM stat_pro_det
WHERE prdeisin IS NULL AND PRDENUSI IS NULL AND prdedprv = '20160114'
GROUP BY prdecdde,
prdenuau,
prdenpol,
prdeano,
prdedtpr;
and
SELECT /*+ PARALLEL(16) */
prdeisin, prdenuau
FROM stat_pro_det, mtauto
WHERE prdedprv = '20160114' AND prdenuau = autonuau AND autoisin IS NULL
GROUP BY prdenuau, prdeisin
First, you might as well rewrite the query as:
SELECT /*+ PARALLEL(16) */ DISTINCT
prdecdde, prdenusi, prdenpol, prdeano, prdedtpr
FROM stat_pro_det
WHERE prdeisin IS NULL AND PRDENUSI IS NOT NULL AND prdedprv = '20160114';
(This is shorter and makes it easier to change the list of columns you are interested in.)
The best index for this query is: stat_pro_det(prdedprv, prdeisin, prdenusi, prdecdde, prdenpol, prdeano, prdedtpr).
The first three columns are important for the WHERE clause and filtering the data. The remaining columns "cover" the query, meaning that the index itself can resolve the query without having to access data pages.
First make a following decisions:
you access using index or using full table scan
you use parallel query or no_parallel
Generall rule is index access work fine for small number of accessed records, but scale not well with a high number.
So the best way test all options and see the results.
For parallel FULL TABLE SCAN
use hint as follows (replace you table name or alias for tab)
SELECT /*+ FULL(tab) PARALLEL(16) */
This scales better, but is not instant for small number of records.
For index access
Note that this will not be done in parallel. Check the note in your explain plan in teh question.
Defining index containing all columns (as proposed by Gordon) you will perform a (sequential) index range scan without accessing the table.
As noted above - depending of the number of accessed keys this will be quick or slow.
For parallel index access
You need to define a GLOBAL partitioned index
create index tab_idx on tab (col3,col2,col1,col4,col5)
global partition by hash (col3,col2,col1,col4,col5) PARTITIONS 16;
Than hint
SELECT /*+ INDEX(tab tab_idx) PARALLEL_INDEX(tab,16) */
You will perform the same index range scan, but this time in parallel. So there is a chance that it will respond bettwer that serial execution. If you realy can open DOP 16 depend of course of your database HW setting and configuration...

Index is not being used by optimizer

I have a query which is performing very badly due to full scan of a table.I have checked the statistics rebuild the indexes but its not working.
SQL Statement:
select distinct NA_DIR_EMAIL d, NA_DIR_EMAIL r
from gcr_items , gcr_deals
where gcr_deals.GCR_DEALS_ID=gcr_items.GCR_DEALS_ID
and
gcr_deals.bu_id=:P0_BU_ID
and
decode(:P55_DIRECT,'ALL','Y',trim(upper(NA_ORG_OWNER_EMAIL)))=
decode(:P55_DIRECT,'ALL','Y',trim(upper(:P55_DIRECT)))
order by 1
Execution Plan :
Plan hash value: 3180018891
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Time |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8 | 00:11:42 |
| 1 | SORT ORDER BY | | 8 | 00:11:42 |
| 2 | HASH UNIQUE | | 8 | 00:11:42 |
|* 3 | HASH JOIN | | 7385 | 00:11:42 |
|* 4 | VIEW | index$_join$_002 | 10462 | 00:00:05 |
|* 5 | HASH JOIN | | | |
|* 6 | INDEX RANGE SCAN | GCR_DEALS_IDX12 | 10462 | 00:00:01 |
| 7 | INDEX FAST FULL SCAN| GCR_DEALS_IDX1 | 10462 | 00:00:06 |
|* 8 | TABLE ACCESS FULL | GCR_ITEMS | 7386 | 00:11:37 |
-------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("GCR_DEALS"."GCR_DEALS_ID"="GCR_ITEMS"."GCR_DEALS_ID")
4 - filter("GCR_DEALS"."BU_ID"=TO_NUMBER(:P0_BU_ID))
5 - access(ROWID=ROWID)
6 - access("GCR_DEALS"."BU_ID"=TO_NUMBER(:P0_BU_ID))
8 - filter(DECODE(:P55_DIRECT,'ALL','Y',TRIM(UPPER("NA_ORG_OWNER_EMAI
L")))=DECODE(:P55_DIRECT,'ALL','Y',TRIM(UPPER(:P55_DIRECT))))
In the beginning a part of the condition in the WHERE clause must be decomposed (or "decompiled" - or "reengeenered") into a simpler form without using decode function, which a form can be understandable by the query optimizer:
AND
decode(:P55_DIRECT,'ALL','Y',trim(upper(NA_ORG_OWNER_EMAIL)))=
decode(:P55_DIRECT,'ALL','Y',trim(upper(:P55_DIRECT)))
into:
AND (
:P55_DIRECT = 'ALL'
OR
trim(upper(:P55_DIRECT)) = trimm(upper(NA_ORG_OWNER_EMAIL))
)
To find rows in the table based on values stored in the index, Oracle uses an access method named Index scan, see this link for details:
https://docs.oracle.com/cd/B19306_01/server.102/b14211/optimops.htm#i52300
One of the most common access method is Index Range Scan see here:
https://docs.oracle.com/cd/B19306_01/server.102/b14211/optimops.htm#i45075
The documentation says (in the latter link) that:
The optimizer uses a range scan when it finds one or more leading
columns of an index specified in conditions, such as the following:
col1 = :b1
col1 < :b1
col1 > :b1
AND combination of the preceding conditions for leading columns in the
index
col1 like 'ASD%' wild-card searches should not be in a leading
position otherwise the condition col1 like '%ASD' does not result in a
range scan.
The above means that the optimizer is able to use the index to find rows only for query conditions that contain basic comparision operators: = < > <= >= LIKE which are used to comparing simple values with plain column names. What the documentation doesn't clearly say - and you need to deduce it reading between the lines - is a fact that when some function is used in the condition, in a form function( column_name ) or function( expression_involving_column_names ) , then the index range scan cannot be used. In this case the query optimizer must evaluate this expression individually for each row in the table, thus must read all rows (perform a full table scan).
A short conclusion and a rule of thumb:
Functions in the WHERE clause can prevent the optimizer from using
indexes
If you see some function somewhere in the WHERE clause, then it is a sign that you are
running the red light
STOP immediately and think three times how
this function impact the query optimizer and the performance of your
query, and try to rewrite the condition to a form that the optimizer
is able to understand.
Now take a look at our rewritten condition:
AND (
:P55_DIRECT = 'ALL'
OR
trim(upper(:P55_DIRECT)) = trimm(upper(NA_ORG_OWNER_EMAIL))
)
and STOP - there are still two functions trim and upper applied to a column named NA_ORG_OWNER_EMAIL. We need to think how they can impact the query optimizer.
I assume that you have created a plain index on a single column: CREATE INDEX somename ON GCR_ITEMS( NA_ORG_OWNER_EMAIL ).If yes, then the index contains only plain values of NA_ORG_OWNER_EMAIL.
But the query is trying to find trimm(upper(NA_ORG_OWNER_EMAIL)) values, which are not stored in the index, so this index cannot be used in this case.
This condition requires a function based index:
https://docs.oracle.com/cd/E11882_01/appdev.112/e41502/adfns_indexes.htm#ADFNS00505
CREATE INDEX somename ON GCR_ITEMS( trim( upper( NA_ORG_OWNER_EMAIL )))
Unfortunately even the function based index will still not help, because the condition in the query is too general - if a value of :P55_DIRECT = ALL the query must retrieve all rows from the table (perform a full table scan), otherwise must use the index to search value within it.
This is because the query is planned (think of it as "compiled") by the query optimizer only once, during it's first execution. Then the plan is stored in the cache and used to execute the query for all further executions. A value of the parameter is not know in advance, so the plan must consider each possible cases, thus will always perform a full table scan.
In 12c there is a new feature "Adaptive query optimalization":
https://docs.oracle.com/database/121/TGSQL/tgsql_optcncpt.htm#TGSQL94982
where the query optimizer analyses each parameters of the query on each runs, and is able to detect that the plan is not optimal for some runtime parameters, and choose a better "subplans" depending on actual parameter's value ... but you must use 12c, and additionally pay for Enterprise Edition, because only this edition includes that feature. And it's still not certain if the adaptive plan will work in this case or not.
What you can do without paying for 12c EE is to DIVIDE this general query into two separate variants, one for a case where :P55_DIRECT = ALL, and the other for remaining cases, and run an appropriate variant in the client (your application) depending on the value of this parameter.
A version for :P55_DIRECT = ALL, that will perform a full table scan
where gcr_deals.GCR_DEALS_ID=gcr_items.GCR_DEALS_ID
and
gcr_deals.bu_id=:P0_BU_ID
order by 1
and a version for other cases, that will use the function based index:
where gcr_deals.GCR_DEALS_ID=gcr_items.GCR_DEALS_ID
and
gcr_deals.bu_id=:P0_BU_ID
and
trim(upper(:P55_DIRECT)) = trimm(upper(NA_ORG_OWNER_EMAIL))
order by 1

PostgreSQL Index not used when data rows are large

Hi I'm curious about why index doesn't work when data rows are large even 100.
Here's select for 10 data:
mydb> explain select * from data where user_id=1;
+-----------------------------------------------------------------------------------+
| QUERY PLAN |
|-----------------------------------------------------------------------------------|
| Index Scan using ix_data_user_id on data (cost=0.14..8.15 rows=1 width=2043) |
| Index Cond: (user_id = 1) |
+-----------------------------------------------------------------------------------+
EXPLAIN
Here's select for 100 data:
mydb> explain select * from data where user_id=1;
+------------------------------------------------------------+
| QUERY PLAN |
|------------------------------------------------------------|
| Seq Scan on data (cost=0.00..44.67 rows=1414 width=945) |
| Filter: (user_id = 1) |
+------------------------------------------------------------+
EXPLAIN
How can index work when data rows are 100?
100 is not a large amount of data. Think 10,000 or 100,000 rows for a respectable amount.
To put it simply, records in a table are stored on data pages. A data page typically has about 8k bytes (it depends on the database and on settings). A major purpose of indexes is to reduce the number of data pages that need to be read.
If all the records in a table fit on one page, there is no need to reduce the number pages being read. The one page will be read. Hence, the index may not be particularly useful.