Question in short
How can I delete around 30M rows from a table with 3B rows without blowing up my PostgreSQL server?
Question in detail
I am using PostgreSQL database (AWS RDS, db.t4g.medium). I have one table to keep track of daily stocks of various items of my customers, with approximately the following rows:
supplier_id: foreign key to customer table
retailer_id: foreign key to customer table
ean: varchar(13), not indexed
datetime: indexed
quantity: integer, not indexed
As of today, this table has 3 billion rows. Now I want to delete all rows for a given supplier. The naive approach would be to execute:
DELETE FROM stocks
WHERE supplier_id = 200
This will result in a non-responsive database for at least 1 hour, after which I have killed the query (since it's making my entire webserver be non-responsive).
I then split up to delete them in batches, based per day. Since I am using Python+Django, it is easy to generate these SQL queries automatically.
DELETE FROM STOCKS
WHERE supplier_id=200
AND datetime >= '2022-08-19T00:00:00+00:00'::timestamptz
AND datetime < '2022-08-20T00:00:00+00:00'::timestamptz
This had solved the issue for the smaller suppliers (1M rows), but for a supplier with 30M rows, this still blows up. After 1 hour, it didn't finish deleting for a single day.
My question is: how can I make sure that this deletion occurs without consuming all resources of the database instance? Are there more clever ways to split it up? Are there other techniques available that I don't know about? I do not care too much about how long it takes (this is not a regularly recurring operation), as long as it is stable.
Additional info
This is the query plan for the deletion of all stocks for a supplier:
EXPLAIN
DELETE FROM stocks
WHERE supplier_id = 158;
Delete on stocks (cost=0.00..80052412.00 rows=0 width=0)
-> Seq Scan on sa_inventory_stocktake (cost=0.00..80052412.00 rows=36725650 width=6)
Filter: (supplier_id = 158)
And this is the query plan for deleting the stocks of a supplier for a given day:
EXPLAIN
DELETE FROM stocks
WHERE supplier_id = 158
AND datetime >= '2022-08-19T00:00:00+00:00'::timestamptz
AND datetime < '2022-08-20T00:00:00+00:00'::timestamptz;
Delete on stocks (cost=92677221.21..92898902.60 rows=0 width=0)
-> Bitmap Heap Scan on stocks (cost=92677221.21..92898902.60 rows=56759 width=6)
Recheck Cond: ((supplier_id = 158) AND (datetime >= '2022-08-19 00:00:00+00'::timestamp with time zone) AND (datetime < '2022-08-20 00:00:00+00'::timestamp with time zone))"
-> BitmapAnd (cost=92677221.21..92677221.21 rows=56759 width=0)
-> Bitmap Index Scan on stocks_supplier_id_c50e0b94 (cost=0.00..892175.08 rows=36725650 width=0)
Index Cond: (supplier_id = 158)
-> Bitmap Index Scan on stocks_lookup (cost=0.00..91785017.50 rows=4614593 width=0)
Index Cond: ((datetime >= '2022-08-19 00:00:00+00'::timestamp with time zone) AND (datetime < '2022-08-20 00:00:00+00'::timestamp with time zone))"
There are the following indices:
BTREE on supplier_id, is_unique=False
BTREE on retailer_id, is_unique=False
BTREE on retailer_id,datetime, is_unique=False
The below code is just an idea. In order not to lock too many rows at a time, we do this blockwise using PostgreSQL's row location CTID. Thus we delete the first n matches, then the next n, and so on, but by using the table's CTID we avoid reading disk blocks that we have already scanned before.
I am not a PostgreSQL developer. My code doesn't work. PostgreSQL's RETURNING clause does not allow aggregation. There may be even more errors in the code. But you get the idea, and I hope you or another PostgreSQL develeoper will be able to fix all issues and get this running.
create or replace procedure delete_from_big_table(p_supplier_id int)
$$
declare
v_max_ctid tid = (0,0);
v_count int;
begin
loop
with next_rows as
(
select ctid
from big_table
where supplier_id = p_supplier_id
and ctid > v_max_ctid
order by ctid
fetch first 50000 rows only
)
delete from big_table
where big_table.ctid in (select ctid from next_rows)
returning max(ctid), count(*) into v_max_ctid, v_count;
if v_count > 0 then
commit;
else
exit;
end if;
end loop;
end;
$$ language plpgsql;
This will be slow, but who cares as long as all the other processes can run at about normal speed.
Related
I have a PostgreSQL 13 database with a table named cache_record, hosted on Amazon RDS.
This is the table's definition:
CREATE TABLE cache_record
(
key text NOT NULL,
type text NOT NULL,
value bytea NOT NULL,
expiration timestamptz NOT NULL,
created_at timestamptz NOT NULL DEFAULT NOW(),
updated_at timestamptz NOT NULL DEFAULT NOW(),
CONSTRAINT cache_record_pkey PRIMARY KEY (key)
)
WITH (
OIDS = FALSE
);
CREATE INDEX cache_record_expiration_idx
ON cache_record USING btree
(expiration ASC NULLS LAST);
The table itself is not referenced by any foreign key (so no indexing/trigger issue) and only contains ~ 30000 rows. The value field does not exceed 1 MB in length on each row, with less than 50 bytes for 50% of the rows. Normally, DELETEs are performed as such:
DELETE FROM cache_record
WHERE expiration < NOW();
There are ~ 10000 expired rows to delete in the table. But this query takes too long to execute and the batch that runs it times out. So I decided to split it in batches and execute it manually from a shell:
DELETE FROM cache_record
WHERE key IN (SELECT key
FROM cache_record
WHERE expiration < NOW()
ORDER BY created_at
LIMIT 100)
One batch of 100 rows takes ~ 30 s to execute, which is absurd. The nested SELECT itself executes a lot faster than the nesting DELETE (with or without LIMIT).
The query never caused any issue until yesterday, when the CRON batch that is supposed to purge entries from the table started to timeout (30 s). Although, it's entirely possible that the query has always been slow but was just under the timeout threshold until yesterday.
What could be causing the slowness?
Edit 2023-01-20
I ran the query using EXPLAIN as suggested in the comments:
EXPLAIN (ANALYSE, BUFFERS) DELETE FROM cache_record WHERE expiration < NOW();
I purged the table yesterday so the query only had a few hits, but it's enough to show the speed issue (> 10 s of execution time):
Delete on cache_record (cost=14.28..501.73 rows=257 width=6) (actual time=10595.107..10595.109 rows=0 loops=1)
Buffers: shared hit=200819 read=43245 dirtied=42783 written=9470
I/O Timings: read=3037.437 write=73.217
-> Bitmap Heap Scan on cache_record (cost=14.28..501.73 rows=257 width=6) (actual time=0.528..29.769 rows=551 loops=1)
Recheck Cond: (expiration < now())
Heap Blocks: exact=88
Buffers: shared hit=10 read=85 dirtied=34 written=21
I/O Timings: read=2.006 write=0.161
-> Bitmap Index Scan on cache_record_expiration_idx (cost=0.00..14.22 rows=257 width=0) (actual time=0.030..0.031 rows=551 loops=1)
Index Cond: (expiration < now())
Buffers: shared hit=7
Planning:
Buffers: shared hit=56
Planning Time: 0.324 ms
Execution Time: 10595.676 ms
Based on the large number of buffers read and dirtied which show up only on the DELETE node, I would say your time is going to maintaining the TOAST table, deleting the huge "value" column. I don't know why it wasn't a problem before, maybe you were naturally deleting only a few records at a time before, or maybe you were principally deleting smaller records before. You said 50% are below 50 bytes, but maybe that 50% is not evenly distributed and you just hit a big slug of large ones.
As for the speed of the select, when you only select the "key" column, it doesn't need to access the TOAST records for the "value" column, so it doesn't spend any time doing so.
I have following table:
create table if not exists inventory
(
expired_at timestamp(0),
-- ...
);
create index if not exists inventory_expired_at_index
on inventory (expired_at);
However when I run following query:
EXPLAIN UPDATE "inventory" SET "status" = 'expired' WHERE "expired_at" < '2020-12-08 12:05:00';
I get next execution plan:
Update on inventory (cost=0.00..4.09 rows=2 width=126)
-> Seq Scan on inventory (cost=0.00..4.09 rows=2 width=126)
Filter: (expired_at < '2020-12-08 12:05:00'::timestamp without time zone)
Same happens for big dataset:
EXPLAIN SELECT * FROM "inventory" WHERE "expired_at" < '2020-12-08 12:05:00';
-[ RECORD 1 ]---------------------------------------------------------------------------
QUERY PLAN | Seq Scan on inventory (cost=0.00..58616.63 rows=1281058 width=71)
-[ RECORD 2 ]---------------------------------------------------------------------------
QUERY PLAN | Filter: (expired_at < '2020-12-08 12:05:00'::timestamp without time zone)
The question is: why not Index Scan but Seq Scan?
This is a bit long for a comment.
The short answer is that you have two rows in the table, so it doesn't make a difference.
The longer answer is that your are using an update, so the data rows have to be retrieved anyway. Using an index requires loading both the index and the data rows and then indirecting from the index to the data rows. It is a little more complicated. And with two rows, not worth the effort at all.
The power of indexes is to handle large amounts of data, not small amounts of data.
To respond to the large question: Database optimizers are not required to use an index. They use some sort of measures (often cost-based optimization) to determine whether or not an index is appropriate. In your larger example, the optimizer has determined that the index is not appropriate. This could happen if the statistics are out-of-synch with the underlying data.
I have a simple count query that can use Index Only Scan, but it still take so long in PostgresQL!
I have a cars table with 2 columns type bigint and active boolean, I also have a multi-column index on those columns
CREATE TABLE cars
(
id BIGSERIAL NOT NULL
CONSTRAINT cars_pkey PRIMARY KEY ,
type BIGINT NOT NULL ,
name VARCHAR(500) NOT NULL ,
active BOOLEAN DEFAULT TRUE NOT NULL,
created_at TIMESTAMP(0) WITH TIME ZONE default NOW(),
updated_at TIMESTAMP(0) WITH TIME ZONE default NOW(),
deleted_at TIMESTAMP(0) WITH TIME ZONE
);
CREATE INDEX cars_type_active_index ON cars(type, active);
I inserted some test data with 950k records, type=1 have 600k records
INSERT INTO cars (type, name) (SELECT 1, 'car-name' FROM generate_series(1,600000));
INSERT INTO cars (type, name) (SELECT 2, 'car-name' FROM generate_series(1,200000));
INSERT INTO cars (type, name) (SELECT 3, 'car-name' FROM generate_series(1,100000));
INSERT INTO cars (type, name) (SELECT 4, 'car-name' FROM generate_series(1,50000));
Let 's run VACUUM ANALYZE and force PostgresQL to use Index Only Scan
VACUUM ANALYSE;
SET enable_seqscan = OFF;
SET enable_bitmapscan = OFF;
OK, I have a simple query on type and active
EXPLAIN (VERBOSE, BUFFERS, ANALYSE)
SELECT count(*)
FROM cars
WHERE type = 1 AND active = true;
Result:
Aggregate (cost=24805.70..24805.71 rows=1 width=0) (actual time=4460.915..4460.918 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=2806
-> Index Only Scan using cars_type_active_index on public.cars (cost=0.42..23304.23 rows=600590 width=0) (actual time=0.051..2257.832 rows=600000 loops=1)
Output: type, active
Index Cond: ((cars.type = 1) AND (cars.active = true))
Filter: cars.active
Heap Fetches: 0
Buffers: shared hit=2806
Planning time: 0.213 ms
Execution time: 4461.002 ms
(11 rows)
Look at the query explain result,
It used Index Only Scan, with index only scan, depending on visibilities map, PostgresQL sometime need to fetch Table Heap to check for visibility of the tuple, But I already run VACUUM ANALYZE so you can see Heap fetch = 0, so reading the index is enough for answer this query.
The size of the index is quite small, it can all fit on the Buffer cache (Buffers: shared hit=2806), PostgresQL does not need to fetch pages from disk.
From there, I can't understand why PostgresQL take that long (4.5s) to answer the query, 1M records is not a big number of records, everything is already cached on memory, and the data on index is visible, it does not need to fetch Heap.
PostgreSQL 9.5.10 on x86_64-pc-linux-gnu, compiled by gcc (Debian 4enter code here.9.2-10) 4.9.2, 64-bit
I tested it on docker 17.09.1-ce, Macbook pro 2015.
I am still new to PostgresQL and trying to map my knowledge with the real cases.
Thanks so much,
It seems like I found the reason, it not about PostgresQL problems, it 's because of running in docker. When I run directly in my mac, the time will be around 100ms which is fast enough.
Another thing I figured out is the reason why PostgresQL still use seq scan instead of index only scan (that why I have to disable seq_scan and bitmapscan in my test):
The size of table is not so big compare to the size of the index, if I add more columns to the table or length of columns is longer, the bigger size of the table, the more chance index can be use.
random_page_cost value by default is 4, my disk is quite fast so I can set it to 1-2, it will help the psql's explainer estimate cost more correctly.
I have a big report table. Bitmap Heap Scan step take more than 5 sec.
Is there something that I can do? I add columns to the table, does reindex the index that it use will help?
I do union and sum on the data, so I don't return 500K records to the client.
I use postgres 9.1.
Here the explain:
Bitmap Heap Scan on foo_table (cost=24747.45..1339408.81 rows=473986 width=116) (actual time=422.210..5918.037 rows=495747 loops=1)
Recheck Cond: ((foo_id = 72) AND (date >= '2013-04-04 00:00:00'::timestamp without time zone) AND (date <= '2013-05-05 00:00:00'::timestamp without time zone))
Filter: ((foo)::text = 'foooooo'::text)
-> Bitmap Index Scan on foo_table_idx (cost=0.00..24628.96 rows=573023 width=0) (actual time=341.269..341.269 rows=723918 loops=1)
Query:
explain analyze
SELECT CAST(date as date) AS date, foo_id, ....
from foo_table
where foo_id = 72
and date >= '2013-04-04'
and date <= '2013-05-05'
and foo = 'foooooo'
Index def:
Index "public.foo_table_idx"
Column | Type
-------------+-----------------------------
foo_id | bigint
date | timestamp without time zone
btree, for table "public.external_channel_report"
Table:
foo is text field with 4 different values.
foo_id is bigint with currently 10K distinct values.
Create a composite index on (foo_id, foo, date) (in this order).
Note that if you select 500k records (and return them all to the client), this may take long.
Are you sure you need all 500k records on the client (rather than some kind of an aggregate or a LIMIT)?
Answer to comment
Do i need the where columns in the same order of the index?
The order of expressions in the WHERE clause is completely irrelevant, SQL is not a procedural language.
Fix mistakes
The timestamp column should not be named "date" for several reasons. Obviously, it's a timestamp, not a date. But more importantly, date it is a reserved word in all SQL standards and a type and function name in Postgres and shouldn't be used as identifier.
You should provide proper information with your question, including a complete table definition and conclusive information about existing indexes. I might be a good idea to start by reading the chapter about indexes in the manual.
The WHERE conditions on the timestamp are most probably incorrect:
and date >= '2013-04-04'
and date <= '2013-05-05'
The upper border for a timestamp column should probably be excluded:
and date >= '2013-04-04'
and date < '2013-05-05'
Index
With the multicolumn index #Quassnoi provided, your query will be much faster, since all qualifying rows can be read from one continuous data block of the index. No row is read in vain (and later disqualified), like you have it now.
But 500k rows will still take some time. Normally you have to verify visibility and fetch additional columns from the table. An index-only scan might be an option in Postgres 9.2+.
The order of columns is best this way, because the rule of thumb is: columns for equality first — then for ranges. More explanation and links in this related answer on dba.SE.
CLUSTER / pg_repack
You could further speed things up by streamlining the table according to this index, so that a minimum of blocks have to be read from the table - if you don't have other requirements that stand against it!
If you want it faster, yet, you could streamline the physical order of rows in your table. If you can afford to lock your table exclusively for a few seconds (at off hours for instance) to rewrite your table and order rows according to the index:
ALTER TABLE foo_table CLUSTER ON idx_myindex_idx;
If concurrent use is a problem, consider pg_repack, which can do the same without exclusive lock.
The effect: fewer blocks need to be read from the table and everything is pre-sorted. It's a one-time effect deteriorating over time, if you have writes on the table. So you would rerun it from time to time.
I copied and adapted the last chapter from this related answer on dba.SE.
I've got a table with around 20 million rows. For arguments sake, lets say there are two columns in the table - an id and a timestamp. I'm trying to get a count of the number of items per day. Here's what I have at the moment.
SELECT DATE(timestamp) AS day, COUNT(*)
FROM actions
WHERE DATE(timestamp) >= '20100101'
AND DATE(timestamp) < '20110101'
GROUP BY day;
Without any indices, this takes about a 30s to run on my machine. Here's the explain analyze output:
GroupAggregate (cost=675462.78..676813.42 rows=46532 width=8) (actual time=24467.404..32417.643 rows=346 loops=1)
-> Sort (cost=675462.78..675680.34 rows=87021 width=8) (actual time=24466.730..29071.438 rows=17321121 loops=1)
Sort Key: (date("timestamp"))
Sort Method: external merge Disk: 372496kB
-> Seq Scan on actions (cost=0.00..667133.11 rows=87021 width=8) (actual time=1.981..12368.186 rows=17321121 loops=1)
Filter: ((date("timestamp") >= '2010-01-01'::date) AND (date("timestamp") < '2011-01-01'::date))
Total runtime: 32447.762 ms
Since I'm seeing a sequential scan, I tried to index on the date aggregate
CREATE INDEX ON actions (DATE(timestamp));
Which cuts the speed by about 50%.
HashAggregate (cost=796710.64..796716.19 rows=370 width=8) (actual time=17038.503..17038.590 rows=346 loops=1)
-> Seq Scan on actions (cost=0.00..710202.27 rows=17301674 width=8) (actual time=1.745..12080.877 rows=17321121 loops=1)
Filter: ((date("timestamp") >= '2010-01-01'::date) AND (date("timestamp") < '2011-01-01'::date))
Total runtime: 17038.663 ms
I'm new to this whole query-optimization business, and I have no idea what to do next. Any clues how I could get this query running faster?
--edit--
It looks like I'm hitting the limits of indices. This is pretty much the only query that gets run on this table (though the values of the dates change). Is there a way to partition up the table? Or create a cache table with all the count values? Or any other options?
Is there a way to partition up the table?
Yes:
http://www.postgresql.org/docs/current/static/ddl-partitioning.html
Or create a cache table with all the count values? Or any other options?
Create a "cache" table certainly is possible. But this depends on how often you need that result and how accurate it needs to be.
CREATE TABLE action_report
AS
SELECT DATE(timestamp) AS day, COUNT(*)
FROM actions
WHERE DATE(timestamp) >= '20100101'
AND DATE(timestamp) < '20110101'
GROUP BY day;
Then a SELECT * FROM action_report will give you what you want in a timely manner. You would then schedule a cron job to recreate that table on a regular basis.
This approach of course won't help if the time range changes with every query or if that query is only run once a day.
In general most databases will ignore indexes if the expected number of rows returned is going to be high. This is because for each index hit, it will need to then find the row as well, so it's faster to just do a full table scan. This number is between 10,000 and 100,000. You can experiment with this by shrinking the date range and seeing where postgres flips to using the index. In this case, postgres is planning to scan 17,301,674 rows, so your table is pretty large. If you make it really small and you still feel like postgres is making the wrong choice then try running an analyze on the table so that postgres gets its approximations right.
It looks like the range covers just about covers all the data available.
This could be a design issue. If you will be running this often, you are better off creating an additional column timestamp_date that contains only the date. Then create an index on that column, and change the query accordingly. The column should be maintained by insert+update triggers.
SELECT timestamp_date AS day, COUNT(*)
FROM actions
WHERE timestamp_date >= '20100101'
AND timestamp_date < '20110101'
GROUP BY day;
If I am wrong about the number of rows the date range will find (and it is only a small subset), then you can try an index on just the timestamp column itself, applying the WHERE clause to just the column (which given the range works just as well)
SELECT DATE(timestamp) AS day, COUNT(*)
FROM actions
WHERE timestamp >= '20100101'
AND timestamp < '20110101'
GROUP BY day;
Try running explain analyze verbose ... to see if the aggregate is using a temp file. Perhaps increase work_mem to allow more to be done in memory?
Set work_mem to say 2GB and see if that changes the plan. If it doesn't, you might be out of options.
What you really want for such DSS type queries is a date table that describes days. In database design lingo it's called a date dimension. To populate such table you can use the code I posted in this article: http://www.mockbites.com/articles/tech/data_mart_temporal
Then in each row in your actions table put the appropriate date_key.
Your query then becomes:
SELECT
d.full_date, COUNT(*)
FROM actions a
JOIN date_dimension d
ON a.date_key = d.date_key
WHERE d.full_date = '2010/01/01'
GROUP BY d.full_date
Assuming indices on the keys and full_date, this will be super fast because it operates on INT4 keys!
Another benefit is that you can now slice and dice by any other date_dimension column(s).