Why does this SUM() function take so long in PostgreSQL? - sql

This is my query:
SELECT SUM(amount) FROM bill WHERE name = 'peter'
There are 800K+ rows in the table. EXPLAIN ANALYZE says:
Aggregate (cost=288570.06..288570.07 rows=1 width=4) (actual time=537213.327..537213.328 rows=1 loops=1)
-> Seq Scan on bill (cost=0.00..288320.94 rows=498251 width=4) (actual time=48385.201..535941.041 rows=800947 loops=1)
Filter: ((name)::text = 'peter'::text)
Rows Removed by Filter: 8
Total runtime: 537213.381 ms
All rows are affected, and this is correct. But why so long? A similar query without WHERE runs way faster:
ANALYZE EXPLAIN SELECT SUM(amount) FROM bill
Aggregate (cost=137523.31..137523.31 rows=1 width=4) (actual time=2198.663..2198.664 rows=1 loops=1)
-> Index Only Scan using idx_amount on bill (cost=0.00..137274.17 rows=498268 width=4) (actual time=0.032..1223.512 rows=800955 loops=1)
Heap Fetches: 533399
Total runtime: 2198.717 ms
I have an index on amount and an index on name. Have I missed any indexes?
ps. I managed to solve the problem just by adding a new idex ON bill(name, amount). I didn't get why it helped, so let's leave the question open for some time...

Since you are searching for a specific name, you should have an index that has name as the first column, e.g. CREATE INDEX IX_bill_name ON bill( name ).
But Postgres can still opt to do a full table scan if it estimates your index to not be specific enough, i.e. if it thinks it is faster to just scan all rows and pick the matching ones instead of consulting an index and start jumping around in the table to gather the matching rows. Postgres uses a cost-based estimation technique that weights random disk reads to be more expensive than sequential reads.
For an index to actually be used in your situation, there should be no more than 10% of the rows matching what you are searching for. Since most of your rows have name=peter it is actually faster to do a full table scan.
As to why the SUM without filtering runs faster has to do with overall width of the table. With a where-clause, postgres has to sequentially read all rows in the table so it can disregard those that do not match the filter. Without a where-clause, postgres can instead read all the amounts from the index. Because the index on amounts contains the amounts and pointers to each corresponding rows, but no other data from the table, it is simply less data to wade through. Based on the big different in performance I guess you have quite a lot of other fields in your table..

Related

Postgres select query making sequential scan instead of index scan on table with 18 Million rows

I have a postgres table which has almost 18 Million rows and I am trying to run this query
select * from answer where submission_id = 5 and deleted_at is NULL;
There is an partial index on the table on column submission_id. This is the command used to create index
CREATE INDEX answer_submission_id ON answer USING btree (submission_id) WHERE (deleted_at IS NULL)
This is the explain analyse of the above select query
Gather (cost=1000.00..3130124.70 rows=834 width=377) (actual time=7607.568..7610.130 rows=2 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=2144966 read=3
I/O Timings: read=6.169
-> Parallel Seq Scan on answer (cost=0.00..3129041.30 rows=348 width=377) (actual time=6501.951..7604.623 rows=1 loops=3)
Filter: ((deleted_at IS NULL) AND (submission_id = 5))
Rows Removed by Filter: 62213625
Buffers: shared hit=2144966 read=3
I/O Timings: read=6.169
Planning Time: 0.117 ms
Execution Time: 7610.154 ms
Ideally it should pick the answer_submission_id index. But postgres is going for an sequential scan.
Any help would be really thankful
The execution plan shows us there is a deviation between the estimated read row and the actual read row.
Postgresql optimizer is a cost-based optimizer (CBO) queries will be executed by the smallest cost from execution plans.
so that the wrong statistics might choose a bad execution plan.
There is a link to represent the wrong statistics causing a slow query.
Why are bad row estimates slow in Postgres?
Firstly I will use this query to search last_analyze & last_vacuum last time.
SELECT
schemaname, relname,
last_vacuum, last_autovacuum,
vacuum_count, autovacuum_count,
last_analyze,last_autoanalyze
FROM pg_stat_user_tables
where relname = 'tablename';
if your statistics are wrong we can use ANALYZE "tablename"to help us collect new statistics from the table, ANALYZE scans speed depends on table size.
For large tables, ANALYZE takes a random sample of the table contents, rather than examining every row. This allows even very large tables to be analyzed in a small amount of time. Note, however, that the statistics are only approximate, and will change slightly each time ANALYZE is run, even if the actual table contents did not change. This might result in small changes in the planner's estimated costs shown by EXPLAIN. In rare situations, this non-determinism will cause the planner's choices of query plans to change after ANALYZE is run. To avoid this, raise the amount of statistics collected by ANALYZE, as described below.
When we UPDATE and DELETE data that will create a dead tuple which might exist in the heap or indexes but we can't query that, VACUUM can help us to reclaim storage occupied by dead tuples.

Prevent usage of index for a particular query in Postgres

I have a slow query in a Postgres DB. Using explain analyze, I can see that Postgres makes bitmap index scan on two different indexes followed by bitmap AND on the two resulting sets.
Deleting one of the indexes makes the evaluation ten times faster (bitmap index scan is still used on the first index). However, that deleted index is useful in other queries.
Query:
select
booking_id
from
booking
where
substitute_confirmation_token is null
and date_trunc('day', from_time) >= cast('01/25/2016 14:23:00.004' as date)
and from_time >= '01/25/2016 14:23:00.004'
and type = 'LESSON_SUBSTITUTE'
and valid
order by
booking_id;
Indexes:
"idx_booking_lesson_substitute_day" btree (date_trunc('day'::text, from_time)) WHERE valid AND type::text = 'LESSON_SUBSTITUTE'::text
"booking_substitute_confirmation_token_key" UNIQUE CONSTRAINT, btree (substitute_confirmation_token)
Query plan:
Sort (cost=287.26..287.26 rows=1 width=8) (actual time=711.371..711.377 rows=44 loops=1)
Sort Key: booking_id
Sort Method: quicksort Memory: 27kB
Buffers: shared hit=8 read=7437 written=1
-> Bitmap Heap Scan on booking (cost=275.25..287.25 rows=1 width=8) (actual time=711.255..711.294 rows=44 loops=1)
Recheck Cond: ((date_trunc('day'::text, from_time) >= '2016-01-25'::date) AND valid AND ((type)::text = 'LESSON_SUBSTITUTE'::text) AND (substitute_confirmation_token IS NULL))
Filter: (from_time >= '2016-01-25 14:23:00.004'::timestamp without time zone)
Buffers: shared hit=5 read=7437 written=1
-> BitmapAnd (cost=275.25..275.25 rows=3 width=0) (actual time=711.224..711.224 rows=0 loops=1)
Buffers: shared hit=5 read=7433 written=1
-> Bitmap Index Scan on idx_booking_lesson_substitute_day (cost=0.00..20.50 rows=594 width=0) (actual time=0.080..0.080 rows=72 loops=1)
Index Cond: (date_trunc('day'::text, from_time) >= '2016-01-25'::date)
Buffers: shared hit=5 read=1
-> Bitmap Index Scan on booking_substitute_confirmation_token_key (cost=0.00..254.50 rows=13594 width=0) (actual time=711.102..711.102 rows=2718734 loops=1)
Index Cond: (substitute_confirmation_token IS NULL)
Buffers: shared read=7432 written=1
Total runtime: 711.436 ms
Can I prevent using a particular index for a particular query in Postgres?
Your clever solution
You already found a clever solution for your particular case: A partial unique index that only covers rare values, so Postgres won't (can't) use the index for the common NULL value.
CREATE UNIQUE INDEX booking_substitute_confirmation_uni
ON booking (substitute_confirmation_token)
WHERE substitute_confirmation_token IS NOT NULL;
It's a textbook use-case for a partial index. Literally! The manual has a similar example and these perfectly matching advice to go with it:
Finally, a partial index can also be used to override the system's
query plan choices. Also, data sets with peculiar distributions might
cause the system to use an index when it really should not. In that
case the index can be set up so that it is not available for the
offending query. Normally, PostgreSQL makes reasonable choices about
index usage (e.g., it avoids them when retrieving common values, so
the earlier example really only saves index size, it is not required
to avoid index usage), and grossly incorrect plan choices are cause
for a bug report.
Keep in mind that setting up a partial index indicates that you know
at least as much as the query planner knows, in particular you know
when an index might be profitable. Forming this knowledge requires
experience and understanding of how indexes in PostgreSQL work. In
most cases, the advantage of a partial index over a regular index will
be minimal.
You commented: The table has few millions of rows and just few thousands of rows with not null values, so this is a perfect use-case. It will even speed up queries on non-null values for substitute_confirmation_token because the index is much smaller now.
Answer to question
To answer your original question: it's not possible to "disable" an existing index for a particular query. You would have to drop it, but that's way to expensive.
Fake drop index
You could drop an index inside a transaction, run your SELECT and then, instead of committing, use ROLLBACK. That's fast, but be aware that (per documentation):
A normal DROP INDEX acquires exclusive lock on the table, blocking
other accesses until the index drop can be completed.
So this is no good for multi-user environments.
BEGIN;
DROP INDEX big_user_id_created_at_idx;
SELECT ...;
ROLLBACK; -- so the index is preserved after all
More detailed statistics
Normally, though, it should be enough to raise the STATISTICS target for the column, so Postgres can more reliably identify common values and avoid the index for those. Try:
ALTER TABLE booking ALTER COLUMN substitute_confirmation_token SET STATISTICS 2000;
Then: ANALYZE booking; before you try your query again. 2000 is an example value. Related:
Keep PostgreSQL from sometimes choosing a bad query plan

Postgres not using index when index scan is much better option

I have a simple query to join two tables that's being really slow. I found out that the query plan does a seq scan on the large table email_activities (~10m rows) while I think using indexes doing nested loops will actually be faster.
I rewrote the query using a subquery in an attempt to force the use of index, then noticed something interesting. If you look at the two query plans below, you will see that when I limit the result set of subquery to 43k, query plan does use index on email_activities while setting the limit in subquery to even 44k will cause query plan to use seq scan on email_activities. One is clearly more efficient than the other, but Postgres doesn't seem to care.
What could cause this? Does it have a configs somewhere that forces the use of hash join if one of the set is larger than certain size?
explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 43000);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=118261.50..118261.50 rows=1 width=4) (actual time=224.556..224.556 rows=1 loops=1)
-> Nested Loop (cost=3699.03..118147.99 rows=227007 width=4) (actual time=32.586..209.076 rows=40789 loops=1)
-> HashAggregate (cost=3698.94..3827.94 rows=43000 width=4) (actual time=32.572..47.276 rows=43000 loops=1)
-> Limit (cost=0.09..3548.44 rows=43000 width=4) (actual time=0.017..22.547 rows=43000 loops=1)
-> Index Scan using index_email_recipients_on_email_campaign_id on email_recipients (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.017..19.168 rows=43000 loops=1)
Index Cond: (email_campaign_id = 1607)
-> Index Only Scan using index_email_activities_on_email_recipient_id on email_activities (cost=0.09..2.64 rows=5 width=4) (actual time=0.003..0.003 rows=1 loops=43000)
Index Cond: (email_recipient_id = email_recipients.id)
Heap Fetches: 40789
Total runtime: 224.675 ms
And:
explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 50000);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=119306.25..119306.25 rows=1 width=4) (actual time=3050.612..3050.613 rows=1 loops=1)
-> Hash Semi Join (cost=4451.08..119174.27 rows=263962 width=4) (actual time=1831.673..3038.683 rows=47935 loops=1)
Hash Cond: (email_activities.email_recipient_id = email_recipients.id)
-> Seq Scan on email_activities (cost=0.00..107490.96 rows=9359988 width=4) (actual time=0.003..751.988 rows=9360039 loops=1)
-> Hash (cost=4276.08..4276.08 rows=50000 width=4) (actual time=34.058..34.058 rows=50000 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 1758kB
-> Limit (cost=0.09..4126.08 rows=50000 width=4) (actual time=0.016..27.302 rows=50000 loops=1)
-> Index Scan using index_email_recipients_on_email_campaign_id on email_recipients (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.016..22.244 rows=50000 loops=1)
Index Cond: (email_campaign_id = 1607)
Total runtime: 3050.660 ms
Version: PostgreSQL 9.3.10 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit
email_activities: ~10m rows
email_recipients: ~11m rows
Index (Only) Scan --> Bitmap Index Scan --> Sequential Scan
For few rows it pays to run an index scan. If enough data pages are visible to all (= vacuumed enough, and not too much concurrent write load) and the index can provide all column values needed, then a faster index only scan is used. With more rows expected to be returned (higher percentage of the table and depending on data distribution, value frequencies and row width) it becomes more likely to find several rows on one data page. Then it pays to switch to a bitmap index scans. (Or to combine multiple distinct indexes.) Once a large percentage of data pages has to be visited anyway, it's cheaper to run a sequential scan, filter surplus rows and skip the overhead for indexes altogether.
Index usage becomes (much) cheaper and more likely when accessing data pages in random order is not (much) more expensive than accessing them in sequential order. That's the case when using SSD instead of spinning disks, or even more so the more is cached in RAM - and the respective configuration parameters random_page_cost and effective_cache_size are set accordingly.
In your case, Postgres switches to a sequential scan, expecting to find rows=263962, that's already 3 % of the whole table. (While only rows=47935 are actually found, see below.)
More in this related answer:
Efficient PostgreSQL query on timestamp using index or bitmap index scan?
Beware of forcing query plans
You cannot force a certain planner method directly in Postgres, but you can make other methods seem extremely expensive for debugging purposes. See Planner Method Configuration in the manual.
SET enable_seqscan = off (like suggested in another answer) does that to sequential scans. But that's intended for debugging purposes in your session only. Do not use this as a general setting in production unless you know exactly what you are doing. It can force ridiculous query plans. The manual:
These configuration parameters provide a crude method of influencing
the query plans chosen by the query optimizer. If the default plan
chosen by the optimizer for a particular query is not optimal, a
temporary solution is to use one of these configuration parameters to force the optimizer to choose a different plan. Better ways to
improve the quality of the plans chosen by the optimizer include
adjusting the planner cost constants (see Section 19.7.2),
running ANALYZE manually, increasing the value of the
default_statistics_target configuration parameter, and
increasing the amount of statistics collected for specific columns
using ALTER TABLE SET STATISTICS.
That's already most of the advice you need.
Keep PostgreSQL from sometimes choosing a bad query plan
In this particular case, Postgres expects 5-6 times more hits on email_activities.email_recipient_id than are actually found:
estimated rows=227007 vs. actual ... rows=40789
estimated rows=263962 vs. actual ... rows=47935
If you run this query often it will pay to have ANALYZE look at a bigger sample for more accurate statistics on the particular column. Your table is big (~ 10M rows), so make that:
ALTER TABLE email_activities ALTER COLUMN email_recipient_id
SET STATISTICS 3000; -- max 10000, default 100
Then ANALYZE email_activities;
Measure of last resort
In very rare cases you might resort to force an index with SET LOCAL enable_seqscan = off in a separate transaction or in a function with its own environment. Like:
CREATE OR REPLACE FUNCTION f_count_dist_recipients(_email_campaign_id int, _limit int)
RETURNS bigint AS
$func$
SELECT COUNT(DISTINCT a.email_recipient_id)
FROM email_activities a
WHERE a.email_recipient_id IN (
SELECT id
FROM email_recipients
WHERE email_campaign_id = $1
LIMIT $2) -- or consider query below
$func$ LANGUAGE sql VOLATILE COST 100000 SET enable_seqscan = off;
The setting only applies to the local scope of the function.
Warning: This is just a proof of concept. Even this much less radical manual intervention might bite you in the long run. Cardinalities, value frequencies, your schema, global Postgres settings, everything changes over time. You are going to upgrade to a new Postgres version. The query plan you force now, may become a very bad idea later.
And typically this is just a workaround for a problem with your setup. Better find and fix it.
Alternative query
Essential information is missing in the question, but this equivalent query is probably faster and more likely to use an index on (email_recipient_id) - increasingly so for a bigger LIMIT.
SELECT COUNT(*) AS ct
FROM (
SELECT id
FROM email_recipients
WHERE email_campaign_id = 1607
LIMIT 43000
) r
WHERE EXISTS (
SELECT FROM email_activities
WHERE email_recipient_id = r.id);
A sequential scan can be more efficient, even when an index exists. In this case, postgres seems to estimate things rather wrong.
An ANALYZE <TABLE> on all related tables can help in such cases. If it doesnt, you can set the variable enable_seqscan to OFF, to force postgres to use an index whenever technically possible, at the expense, that sometimes an index-scan will be used when a sequential scan would perform better.

How to manually update the statistics data of tables in PostgreSQL

The ANALYZE statement can be used in PostgreSQL to collect the statistics data of tables. However, I do not want to actually insert these data into tables, I just need to evaluate the cost of some queries, is there anyway to manually specify the statistics data of tables in PostgreSQL without actually putting data into it?
I think you are muddling ANALYZE with EXPLAIN ANALYZE. There are different things.
If you want query costs and timing without applying the changes, the only real option you have is to begin a transaction, execute the query under EXPLAIN ANALYZE, and then ROLLBACK.
This still executes the query, meaning that:
CPU time and I/O are consumed
Locks are still taken and held for the duration
New rows are actually written to the tables and indexes, but are never marked visible. They are cleaned up in the next VACUUM.
You can already EXPLAIN ANALYSE a query even with no inserted data, it will help you to get a feeling of the execution plan.
But there's no such thing as real data :)
What you can do, as a workaround, is BEGINning a transaction, INSERT some data, EXPLAIN ANALYSE your query, then ROLLBACK your transaction.
Example :
mydatabase=# BEGIN;
BEGIN
mydatabase=# INSERT INTO auth_message (user_id, message) VALUES (1, 'foobar');
INSERT 0 1
mydatabase=# EXPLAIN ANALYSE SELECT count(*) FROM auth_message;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Aggregate (cost=24.50..24.51 rows=1 width=0) (actual time=0.011..0.011 rows=1 loops=1)
-> Seq Scan on auth_message (cost=0.00..21.60 rows=1160 width=0) (actual time=0.007..0.008 rows=1 loops=1)
Total runtime: 0.042 ms
(3 lignes)
mydatabase=# ROLLBACK;
ROLLBACK
mydatabase=# EXPLAIN ANALYSE SELECT count(*) FROM auth_message;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Aggregate (cost=24.50..24.51 rows=1 width=0) (actual time=0.011..0.011 rows=1 loops=1)
-> Seq Scan on auth_message (cost=0.00..21.60 rows=1160 width=0) (actual time=0.009..0.009 rows=0 loops=1)
Total runtime: 0.043 ms
(3 lignes)
The 1st EXPLAIN ANALYSE shows that there was some "temporary" data (rows=1)
This is not strictly a "mock", but at least, PostgreSQL plan execution (and various optimizations it could do) should be, IMHO, best than with no data (disclaimer : purely intuitive)

Indexing a column used to ORDER BY with a constraint in PostgreSQL

I've got a modest table of about 10k rows that is often sorted by a column called 'name'. So, I added an index on this column. Now selects on it are fast:
EXPLAIN ANALYZE SELECT * FROM crm_venue ORDER BY name ASC LIMIT 10;
...query plan...
Limit (cost=0.00..1.22 rows=10 width=154) (actual time=0.029..0.065 rows=10 loops=1)
-> Index Scan using crm_venue_name on crm_venue (cost=0.00..1317.73 rows=10768 width=154) (actual time=0.026..0.050 rows=10 loops=1)
Total runtime: 0.130 ms
If I increase the LIMIT to 60 (which is roughly what I use in the application) the total runtime doesn't increase much further.
Since I'm using a "logical delete pattern" on this table I only consider entries where the delete_date NULL. So this is a common select I make:
SELECT * FROM crm_venue WHERE delete_date IS NULL ORDER BY name ASC LIMIT 10;
To make that query snappy as well I put an index on the name column with a constraint like this:
CREATE INDEX name_delete_date_null ON crm_venue (name) WHERE delete_date IS NULL;
Now it's fast to do the ordering with the logical delete constraint:
EXPLAIN ANALYZE SELECT * FROM crm_venue WHERE delete_date IS NULL ORDER BY name ASC LIMIT 10;
Limit (cost=0.00..84.93 rows=10 width=154) (actual time=0.020..0.039 rows=10 loops=1)
-> Index Scan using name_delete_date_null on crm_venue (cost=0.00..458.62 rows=54 width=154) (actual time=0.018..0.033 rows=10 loops=1)
Total runtime: 0.076 ms
Awesome! But this is were I get myself into trouble. The application rarely calls for the first 10 rows. So, let's select some more rows:
EXPLAIN ANALYZE SELECT * FROM crm_venue WHERE delete_date IS NULL ORDER BY name ASC LIMIT 20;
Limit (cost=135.81..135.86 rows=20 width=154) (actual time=18.171..18.189 rows=20 loops=1)
-> Sort (cost=135.81..135.94 rows=54 width=154) (actual time=18.168..18.173 rows=20 loops=1)
Sort Key: name
Sort Method: top-N heapsort Memory: 21kB
-> Bitmap Heap Scan on crm_venue (cost=4.67..134.37 rows=54 width=154) (actual time=2.355..8.126 rows=10768 loops=1)
Recheck Cond: (delete_date IS NULL)
-> Bitmap Index Scan on crm_venue_delete_date_null_idx (cost=0.00..4.66 rows=54 width=0) (actual time=2.270..2.270 rows=10768 loops=1)
Index Cond: (delete_date IS NULL)
Total runtime: 18.278 ms
As you can see it goes from 0.1 ms to 18!!
Clearly what happens is that there's a point where the ordering can no longer use the index to run the sort. I noticed that as I increase the LIMIT number from 20 to higher numbers it always takes around 20-25 ms.
Am I doing it wrong or is this a limitation of PostgreSQL? What is best way to set up indexes for this type of queries?
My guess would be that since, logically, the index is comprised of pointers to a set of rows on a set of data pages. if you fetch a page that is known to ONLY have "deleted" records on it, it doesn't have to recheck the page once it is fetched to only fetch the records that are deleted.
Therefore, it may be that when you do LIMIT 10 and order by the name, the first 10 that come back from the index are all on a data page (or pages) that are comprised only of deleted records. Since it knows that these pages are homogeneous, then it doesn't have to recheck them once it's fetched them from disk. Once you increase to LIMIT 20, at least one of the first 20 is on a mixed page with non-deleted records. This would then force the executor to recheck each record since it can't fetch data pages in less than 1 page increments from either the disk or the cache.
As an experiment, if you can create an index (delete_date, name) and issue the command CLUSTER crm_venue ON where the index is your new index. This should rebuild the table in the sort order of delete_date then name. Just to be super-sure, you should then issue a REINDEX TABLE crm_venue. Now try the query again. Since all the NOT NULLs will be clustered together on disk, it may work faster with the larger LIMIT values.
Of course, this is all just off-the-cuff theory, so YMMV...
As you increase the number of rows, the index cardinality changes. I am not sure, but it could be that because it is using a greater number of rows from the table, it will need to read enough table blocks that those plus the index blocks are enough to make the index no longer make sense to use. This may be a miscalculation by the planner. Also your name (the field indexed) is not the field limiting the scope of the index which may be wreaking havoc with the planner math.
Things to try:
Increase the percentage of the table considered when building your statistics, your data may be skewed in such a way that the statistics are not picking up a true representative sample.
Index all rows, not just the NULL rows, see which is better. you could even try indexing where NOT NULL.
Cluster based on an index on that field to reduce the data blocks required and turn it into a range scan.
Nulls and indexes do not always play nice. Try another way:
alter table crm_venue add column char delete_flag;
update crm_venue set delete flag='Y' where delete_date is not null;
update crm_venue set delete flag='N' where delete_date is null;
create index deleted_venue (delete_flag) where delete_flag = 'N';
SELECT * FROM crm_venue WHERE delete__flag='Y' ORDER BY name ASC LIMIT 20;