How to efficiently delete rows from a Postgresql 8.1 table? - sql

I'm working on a PostgreSQL 8.1 SQL script which needs to delete a large number of rows from a table.
Let's say the table I need to delete from is Employees (~260K rows).
It has primary key named id.
The rows I need to delete from this table are stored in a separate temporary table called EmployeesToDelete (~10K records) with a foreign key reference to Employees.id called employee_id.
Is there an efficient way to do this?
At first, I thought of the following:
DELETE
FROM Employees
WHERE id IN
(
SELECT employee_id
FROM EmployeesToDelete
)
But I heard that using the "IN" clause and subqueries can be inefficient, especially with larger tables.
I've looked at the PostgreSQL 8.1 documentation, and there's mention of
DELETE FROM ... USING but it doesn't have examples so I'm not sure how to use it.
I'm wondering if the following works and is more efficient?
DELETE
FROM Employees
USING Employees e
INNER JOIN
EmployeesToDelete ed
ON e.id = ed.employee_id
Your comments are greatly appreciated.
Edit:
I ran EXPLAIN ANALYZE and the weird thing is that the first DELETE ran pretty quickly (within seconds), while the second DELETE took so long (over 20 min) I eventually cancelled it.
Adding an index to the temp table helped the performance quite a bit.
Here's a query plan of the first DELETE for anyone interested:
Hash Join (cost=184.64..7854.69 rows=256482 width=6) (actual time=54.089..660.788 rows=27295 loops=1)
Hash Cond: ("outer".id = "inner".employee_id)
-> Seq Scan on Employees (cost=0.00..3822.82 rows=256482 width=10) (actual time=15.218..351.978 rows=256482 loops=1)
-> Hash (cost=184.14..184.14 rows=200 width=4) (actual time=38.807..38.807 rows=10731 loops=1)
-> HashAggregate (cost=182.14..184.14 rows=200 width=4) (actual time=19.801..28.773 rows=10731 loops=1)
-> Seq Scan on EmployeesToDelete (cost=0.00..155.31 rows=10731 width=4) (actual time=0.005..9.062 rows=10731 loops=1)
Total runtime: 935.316 ms
(7 rows)
At this point, I'll stick with the first DELETE unless I can find a better way of writing it.

Don't guess, measure. Try the various methods and see which one is the shortest to execute. Also, use EXPLAIN to know what PostgreSQL will do and see where you can optimize. Very few PostgreSQL users are able to guess correctly the fastest query...

I'm wondering if the following works and is more efficient?
DELETE
FROM Employees e
USING EmployeesToDelete ed
WHERE id = ed.employee_id;
This totally depend on your index selectivity.
PostgreSQL tends to employ MERGE IN JOIN for IN predicates, which has stable execution time.
It's not affected by how many rows satisfy this condition, provided that you already have an ordered resultset.
An ordered resultset requires either a sort operation or an index. Full index traversal is very inefficient in PostgreSQL compared to SEQ SCAN.
The JOIN predicate, on the other hand, may benefit from using NESTED LOOPS if your index is very selective, and from using HASH JOIN is it's inselective.
PostgreSQL should select the right one by estimating the row count.
Since you have 30k rows against 260K rows, I expect HASH JOIN to be more efficient, and you should try to build a plan on a DELETE ... USING query.
To make sure, please post execution plan for both queries.

I'm not sure about the DELETE FROM ... USING syntax, but generally, a subquery should logically be the same thing as an INNER JOIN anyway. The database query optimizer should be capable (and this is just a guess) of executing the same query plan for both.

Why can't you delete the rows in the first place instead of adding them to the EmployeesToDelete table?
Or if you need to undo, just add a "deleted" flag to Employees, so you can reverse the deletion, or make in permanent, all in one table?

Related

Can't improve SQL join speed with indexes

I'm totally new to SQL and I am trying to speed up join queries for very large data. I started adding indexes (but to be honest, I don't have a deep understanding of them) and not seeing much change, I decided to benchmark on a more simple, simulated example. I'm using the psql interface of PostgreSQL 11.5 on MacOS 10.14.6. The data server is hosted locally on my computer. I apologize for any lack of relevant information, first time posting about SQL.
Databases' Structures
I created two initially identical databases, db and db_idx. I never put any index or key on tables in db, while I try putting indexes and keys on tables in db_idx. I then run simple join queries within db and db_idx separately and I compare the performance. Specifically, db_idx is made of two tables:
A client table with with 100,000 rows and the following structure:
Table "public.client"
Column | Type | Collation | Nullable | Default
-------------+---------+-----------+----------+---------
client_id | integer | | not null |
client_name | text | | |
Indexes:
"pkey_c" PRIMARY KEY, btree (client_id)
A client_additional table with 70,000 rows and the following structure:
Table "public.client_additional"
Column | Type | Collation | Nullable | Default
------------+---------+-----------+----------+---------
client_id | integer | | not null |
client_age | integer | | |
Indexes:
"pkey_ca" PRIMARY KEY, btree (client_id)
"cov_idx" btree (client_id, client_age)
The client_id column in the client_additional table contains a subset of client's client_id values. Note the primary keys, and the other index I created on client_additional. I thought these would increase the benchmark query speed (see below) but it did not.
Importantly the db database is exactly the same (same structure, same values) except that it has no index or key.
Side note: the client and client_additional table should perhaps be a single table, since they give information at exactly the same level (client level). However, the database I'm using in real life came structured this way: some tables are split into several tables by "topic" although they give information at the same level. I don't know if that matters for my issue.
Benchmark Query
I'm using the following query, which mimics a lot what I need to do with real data:
SELECT
client_additional.client_id,
client_additional.client_age,
client.client_name
FROM client
INNER JOIN client_additional
ON client.client_id = client_additional.client_id;
Benchmark Results
On both databases, the benchmark query takes about 630 ms. Removing the keys and/or indexes in db_idx does not change anything. These benchmark results carry over to larger data sizes: speed is identical in the indexed and non-indexed cases.
That's where I am. How do I explain these results? Can I improve the join speed and how?
Use the EXPLAIN verb to see how the SQL engine intends to resolve the query. (Different SQL engines present this in different ways.) You can conclusively see whether the index will be used.
Also, you'll first need to load the tables with a lot of test data, because EXPLAIN will tell you what the SQL engine intends to do right now, and this decision is based in part on the size of the table and various other statistics. If the table is virtually empty, the SQL engine might decide that the index wouldn't be helpful now.
SQL engines use all kinds of very clever tricks to optimize performance, so it's actually rather difficult to get a useful timing test. But, if EXPLAIN tells you that the index is being used, that's pretty much the answer that you're looking for.
Setting up a small test DB, adding some rows and running your query:
CREATE TABLE client
(
client_id integer PRIMARY KEY,
client_name text
);
CREATE TABLE client_additional
(
client_id integer PRIMARY KEY,
client_age integer
);
INSERT INTO client (client_id, client_name) VALUES (generate_series(1,100000),'Phil');
INSERT INTO client_additional (client_id, client_age) VALUES (generate_series(1,70000),21);
ANALYZE;
EXPLAIN ANALYZE SELECT
client_additional.client_id,
client_additional.client_age,
client.client_name
FROM
client
INNER JOIN
client_additional
ON
client.client_id = client_additional.client_id;
gave me this plan:
Hash Join (cost=1885.00..3590.51 rows=70000 width=11) (actual time=158.958..44 1.222 rows=70000 loops=1)
Hash Cond: (client.client_id = client_additional.client_id)
-> Seq Scan on client (cost=0.00..1443.00 rows=100000 width=7) (actual time =0.019..100.318 rows=100000 loops=1)
-> Hash (cost=1010.00..1010.00 rows=70000 width=8) (actual time=158.785..15 8.786 rows=70000 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 3759kB
-> Seq Scan on client_additional (cost=0.00..1010.00 rows=70000 width =8) (actual time=0.016..76.507 rows=70000 loops=1)
Planning Time: 0.357 ms
Execution Time: 506.739 ms
What you can see from this is both tables were sequentially scanned, the values from each table were hashed and a hash join was done. Postgres determined this was the optimal way to execute this query.
If you were to recreate the tables without the Primary Key (and therefore remove the implicit index on the PK column of each), you get exactly the same plan, as Postgres has determined that the quickest way to execute this query is by ignoring the indexes and by hashing the table's values then doing a hash join on the two sets of hashed values to get the result.
After changing the number of rows in the client table like so:
TRUNCATE Client;
INSERT INTO client (client_id, client_name) VALUES (generate_series(1,200000),'phil');
ANALYZE;
Then I re-ran the same query and I see this plan instead:
Merge Join (cost=1.04..5388.45 rows=70000 width=13) (actual time=0.050..415.50
3 rows=70000 loops=1)
Merge Cond: (client.client_id = client_additional.client_id)
-> Index Scan using client_pkey on client (cost=0.42..6289.42 rows=200000 width=9) (actual time=0.022..86.897 rows=70001 loops=1)
-> Index Scan using client_additional_pkey on client_additional (cost=0.29..2139.29 rows=70000 width=8) (actual time=0.016..86.818 rows=70000 loops=1)
Planning Time: 0.517 ms
Execution Time: 484.264 ms
Here you can see that index scans were done, as Postgres has determined that this plan is a better one based on the current number of rows in the tables.
The point is that Postgres will use the indexes when it feels they will produce a faster result, but the thresholds before they are used are somewhat higher than you may have expected.
All best,
Phil
You have a primary key on the two tables which will be used for the joins. If you want to really see the queries slow down, remove the primary keys.
What is happening? Well, my guess is that the execution plans are the same with or without the secondary indexes. You would need to look at the plans themselves.
Unlike most other databases, Postgres does not get a benefit from covering indexes, because lock information is stored in the data pages only. So, the data pages always need to be accessed.

Postgresql select count query takes long time

I have a table named events in my Postgresql 9.5 database. And this table has about 6 million records.
I am runnig a select count(event_id) from events query. But this query takes 40seconds. This is very long time for a database. My event_id field of table is primary key and indexed. Why this takes very long time? (Server is ubuntu vm on vmware has 4cpu)
Explain:
"Aggregate (cost=826305.19..826305.20 rows=1 width=0) (actual time=24739.306..24739.306 rows=1 loops=1)"
" Buffers: shared hit=13 read=757739 dirtied=53 written=48"
" -> Seq Scan on event_source (cost=0.00..812594.55 rows=5484255 width=0) (actual time=0.014..24087.050 rows=6320689 loops=1)"
" Buffers: shared hit=13 read=757739 dirtied=53 written=48"
"Planning time: 0.369 ms"
"Execution time: 24739.364 ms"
I know that this is an old question and the existing answer covers the vast majority of info around this, but I just ran into a situation where a table of 1.3 million rows was taking about 35 seconds to perform a simple SELECT COUNT(*). None of the other solutions helped. The issue ended up being that the table was just bloated and hadn't been vacuumed, so Postgres couldn't figure out the most optimal way to query the data. After I ran this, the query time dropped down to about 25ms!
VACUUM (ANALYZE, VERBOSE, FULL) my_table_name;
Hope this helps someone else!
There are multiple factors playing a big role in the decision for PostgreSQL how to execute the count(), but first of all, the column you use inside the count function does not matter. In fact, if you don't need DISTINCT count, stick with count(*).
You can try the following to force an index-only scan:
SELECT count(*) FROM (SELECT event_id FROM events) t;
...if that still results in a sequential scan, than most likely the index is not much smaller than the table itself. To still see how an index-only scan would perform, you can enforce it with:
SELECT count(*) FROM (SELECT event_id FROM events ORDER BY 1) t;
IF that is not much faster, you should also consider an upgrade of the PostgreSQL to at least version 9.6, which introduces parallel sequential scans to speed up these things.
In addition, you can achieve dramatic speedups choosing from a variety of techniques to provide counts which largely depend on your use-case and your requirements:
Faster PostgreSQL Counting
Last but not least, please always provide the output of an extended explain as #a_horse_with_no_name already recommended, e.g.:
EXPLAIN (ANALYZE, BUFFERS) SELECT count(event_id) FROM events;

Why does this simple query not use the index in postgres?

In my postgreSQL database I have a table named "product". In this table I have a column named "date_touched" with type timestamp. I created a simple btree index on this column. This is the schema of my table (I omitted irrelevant column & index definitions):
Table "public.product"
Column | Type | Modifiers
---------------------------+--------------------------+-------------------
id | integer | not null default nextval('product_id_seq'::regclass)
date_touched | timestamp with time zone | not null
Indexes:
"product_pkey" PRIMARY KEY, btree (id)
"product_date_touched_59b16cfb121e9f06_uniq" btree (date_touched)
The table has ~300,000 rows and I want to get the n-th element from the table ordered by "date_touched". when I want to get the 1000th element, it takes 0.2s, but when I want to get the 100,000th element, it takes about 6s. My question is, why does it take too much time to retrieve the 100,000th element, although I've defined a btree index?
Here is my query with explain analyze that shows postgreSQL does not use the btree index and instead sorts all rows to find the 100,000th element:
first query (100th element):
explain analyze
SELECT product.id
FROM product
ORDER BY product.date_touched ASC
LIMIT 1
OFFSET 1000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Limit (cost=3035.26..3038.29 rows=1 width=12) (actual time=160.208..160.209 rows=1 loops=1)
-> Index Scan using product_date_touched_59b16cfb121e9f06_uniq on product (cost=0.42..1000880.59 rows=329797 width=12) (actual time=16.651..159.766 rows=1001 loops=1)
Total runtime: 160.395 ms
second query (100,000th element):
explain analyze
SELECT product.id
FROM product
ORDER BY product.date_touched ASC
LIMIT 1
OFFSET 100000;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Limit (cost=106392.87..106392.88 rows=1 width=12) (actual time=6621.947..6621.950 rows=1 loops=1)
-> Sort (cost=106142.87..106967.37 rows=329797 width=12) (actual time=6381.174..6568.802 rows=100001 loops=1)
Sort Key: date_touched
Sort Method: external merge Disk: 8376kB
-> Seq Scan on product (cost=0.00..64637.97 rows=329797 width=12) (actual time=1.357..4184.115 rows=329613 loops=1)
Total runtime: 6629.903 ms
It is a very good thing, that SeqScan is used here. Your OFFSET 100000 is not a good thing for the IndexScan.
A bit of theory
Btree indexes contain 2 structures inside:
balanced tree and
double-linked list of keys.
First structure allows for fast keys lookups, second is responsible for the ordering. For bigger tables, linked list cannot fit into a single page and therefore it is a list of linked pages, where each page's entries maintain ordering, specified during index creation.
It is wrong to think, though, that such pages are sitting together on the disk. In fact, it is more probable that those are spread across different locations. And in order to read pages based on the index's order, system has to perform random disk reads. Random disk IO is expensive, compared to sequential access. Therefore good optimizer will prefer a SeqScan instead.
I highly recommend “SQL Performance Explained” book to better understand indexes. It is also available on-line.
What is going on?
Your OFFSET clause would cause database to read index's linked list of keys (causing lots of random disk reads) and than discarding all those results, till you hit the wanted offset. And it is good, in fact, that Postgres decided to use SeqScan + Sort here — this should be faster.
You can check this assumption by:
running EXPLAIN (analyze, buffers) of your big-OFFSET query
than do SET enable_seqscan TO 'off';
and run EXPLAIN (analyze, buffers) again, comparing the results.
In general, it is better to avoid OFFSET, as DBMSes not always pick the right approach here. (BTW, which version of PostgreSQL you're using?)
Here's a comparison of how it performs for different offset values.
EDIT: In order to avoid OFFSET one would have to base pagination on the real data, that exists in the table and is a part of the index. For this particular case, the following might be possible:
show first N (say, 20) elements
include maximal date_touched that is shown on the page to all the “Next” links. You can compute this value on the application side. Do similar for the “Previous” links, except include minimal date_touch for these.
on the server side you will get the limiting value. Therefore, say for the “Next” case, you can do a query like this:
SELECT id
FROM product
WHERE date_touched > $max_date_seen_on_the_page
ORDER BY date_touched ASC
LIMIT 20;
This query makes best use of the index.
Of course, you can adjust this example to your needs. I used pagination as it is a typical case for the OFFSET.
One more note — querying 1 row many times, increasing offset for each query by 1, will be much more time consuming, than doing a single batch query that returns all those records, which are then iterated from on the application side.

PostgreSQL does not use a partial index

I have a table in PostgreSQL 9.2 that has a text column. Let's call this text_col. The values in this column are fairly unique (may contain 5-6 duplicates at the most). The table has ~5 million rows. About half these rows contain a null value for text_col. When I execute the following query I expect 1-5 rows. In most cases (>80%) I only expect 1 row.
Query
explain analyze SELECT col1,col2.. colN
FROM table
WHERE text_col = 'my_value';
A btree index exists on text_col. This index is never used by the query planner and I am not sure why. This is the output of the query.
Planner
Seq Scan on two (cost=0.000..459573.080 rows=93 width=339) (actual time=1392.864..3196.283 rows=2 loops=1)
Filter: (victor = 'foxtrot'::text)
Rows Removed by Filter: 4077384
I added another partial index to try to filter out those values that were not null, but that did not help (with or without text_pattern_ops. I do not need text_pattern_ops considering no LIKE conditions are expressed in my queries, but they also match equality).
CREATE INDEX name_idx
ON table
USING btree
(text_col COLLATE pg_catalog."default" text_pattern_ops)
WHERE text_col IS NOT NULL;
Disabling sequence scans using set enable_seqscan = off; makes the planner still pick the seqscan over an index_scan. In summary...
The number of rows returned by this query is small.
Given that the non-null rows are fairly unique, an index scan over the text should be faster.
Vacuuming and analyzing the table did not help the optimizer pick the index.
My questions
Why does the database pick the sequence scan over the index scan?
When a table has a text column whose equality condition should be checked, are there any best practices I can adhere to?
How do I reduce the time taken for this query?
[Edit - More information]
The index scan is picked up on my local database that houses about 10% of the data that is available in production.
A partial index is a good idea to exclude half the rows of the table which you obviously do not need. Simpler:
CREATE INDEX name_idx ON table (text_col)
WHERE text_col IS NOT NULL;
Be sure to run ANALYZE table after creating the index. (Autovacuum does that automatically after some time if you don't do it manually, but if you test right after creation, your test will fail.)
Then, to convince the query planner that a particular partial index can be used, repeat the WHERE condition in the query - even if it seems completely redundant:
SELECT col1,col2, .. colN
FROM table
WHERE text_col = 'my_value'
AND text_col IS NOT NULL; -- repeat condition
Voilá.
Per documentation:
However, keep in mind that the predicate must match the conditions
used in the queries that are supposed to benefit from the index. To be
precise, a partial index can be used in a query only if the system can
recognize that the WHERE condition of the query mathematically implies
the predicate of the index. PostgreSQL does not have a sophisticated
theorem prover that can recognize mathematically equivalent
expressions that are written in different forms. (Not only is such a
general theorem prover extremely difficult to create, it would
probably be too slow to be of any real use.) The system can recognize
simple inequality implications, for example "x < 1" implies "x < 2";
otherwise the predicate condition must exactly match part of the
query's WHERE condition or the index will not be recognized as usable.
Matching takes place at query planning time, not at run time. As a
result, parameterized query clauses do not work with a partial index.
As for parameterized queries: again, add the (redundant) predicate of the partial index as an additional, constant WHERE condition, and it works just fine.
An important update in Postgres 9.6 largely improves chances for index-only scans (which can make queries cheaper and the query planner will more readily chose such query plans). Related:
PostgreSQL not using index during count(*)
A partial index is only used if the WHERE conditions match. Thus an index with WHERE text_col IS NOT NULL can only be used if you use the same condition in your SELECT. Collation mismatch could also cause harm.
Try the following:
Make a simplest possible btree index CREATE INDEX foo ON table (text_col)
ANALYZE table
Query
I figured it out. Upon taking a closer look at the pg_stats view that analyze helps build, I came across this excerpt on the documentation.
Correlation
Statistical correlation between physical row ordering and logical
ordering of the column values. This ranges from -1 to +1. When the
value is near -1 or +1, an index scan on the column will be estimated
to be cheaper than when it is near zero, due to reduction of random
access to the disk. (This column is null if the column data type does
not have a < operator.)
On my local box the correlation number is 0.97 and on production it was 0.05. Thus the planner is estimating that it is easier to go through all those rows sequentially instead of looking up the index each time and diving into a random access on the disk block. This is the query I used to peek at the correlation number.
select * from pg_stats where tablename = 'table_name' and attname = 'text_col';
This table also has a few updates performed on its rows. The avg_width of the rows is estimated to be 20 bytes. If the update has a large value for a text column, it can exceed the average and also result in a slower update. My guess was that the physical and logical ordering are slowing moving apart with each update. To fix that I executed the following queries.
ALTER TABLE table_name SET (FILLFACTOR = 80);
VACUUM FULL table_name;
REINDEX TABLE table_name;
ANALYZE table_name;
The idea is that I could give each disk block a 20% buffer and vacuum full the table to reclaim lost space and maintain physical and logical order. After I did this the query picks up the index.
Query
explain analyze SELECT col1,col2... colN
FROM table_name
WHERE text_col is not null
AND
text_col = 'my_value';
Partial index scan - 1.5ms
Index Scan using tango on two (cost=0.000..165.290 rows=40 width=339) (actual time=0.083..0.086 rows=1 loops=1)
Index Cond: ((victor five NOT NULL) AND (victor = 'delta'::text))
Excluding the NULL condition picks up the other index with a bitmap heap scan.
Full index - 0.08ms
Bitmap Heap Scan on two (cost=5.380..392.150 rows=98 width=339) (actual time=0.038..0.039 rows=1 loops=1)
Recheck Cond: (victor = 'delta'::text)
-> Bitmap Index Scan on tango (cost=0.000..5.360 rows=98 width=0) (actual time=0.029..0.029 rows=1 loops=1)
Index Cond: (victor = 'delta'::text)
[EDIT]
While it initially looked like correlation plays a major role in choosing the index scan #Mike has observed that a correlation value that is close to 0 on his database still resulted in an index scan. Changing fill factor and vacuuming fully has helped but I'm unsure why.

How to increase query speed without using full-text search?

This is my simple query; By searching selectnothing I'm sure I'll have no hits.
SELECT nome_t FROM myTable WHERE nome_t ILIKE '%selectnothing%';
This is the EXPLAIN ANALYZE VERBOSE
Seq Scan on myTable (cost=0.00..15259.04 rows=37 width=29) (actual time=2153.061..2153.061 rows=0 loops=1)
Output: nome_t
Filter: (nome_t ~~* '%selectnothing%'::text)
Total runtime: 2153.116 ms
myTable has around 350k rows and the table definition is something like:
CREATE TABLE myTable (
nome_t text NOT NULL,
)
I have an index on nome_t as stated below:
CREATE INDEX idx_m_nome_t ON myTable
USING btree (nome_t);
Although this is clearly a good candidate for Fulltext search I would like to rule that option out for now.
This query is meant to be run from a web application and currently it's taking around 2 seconds which is obviously too much;
Is there anything I can do, like using other index methods, to improve the speed of this query?
No, ILIKE '%selectnothing%' always needs a full table scan, every index is useless. You need full text search, it's not that hard to implement.
Edit: You could use a Wildspeed, I forgot about this option. The indexes will be huge, but your performance will also be much better.
Wildspeed extension provides GIN index
support for wildcard search for LIKE
operator.
http://www.sai.msu.su/~megera/wiki/wildspeed
another thing you can do-- is break this nome_t column in table myTable into it's own table. Searching one column out of a table is slow (if there are fifty other wide columns) because the other data effectively slows down the scan against that column (because there are less records per page/extent).