Very high cost for low limit LIMIT / high OFFSET - sql

I'm having very large table with products. I need to select several products at very high offset (example below). Postgresql manual on indexes and performance suggests to create index on column that's used by ORDER BY + eventual conditions. Everything is peachy, no sort is used. but for high offset values LIMIT is very costly. Anyone have any idea what might be a cause for that?
Following query can run for minutes.
Indexes:
"product_slugs_pkey" PRIMARY KEY, btree (id)
"index_for_listing_by_default_active" btree (priority DESC, name, active)
"index_for_listing_by_name_active" btree (name, active)
"index_for_listing_by_price_active" btree (master_price, active)
"product_slugs_product_id" btree (product_id)
EXPLAIN SELECT * FROM "product_slugs" WHERE ("product_slugs"."active" = 1) ORDER BY product_slugs.name ASC LIMIT 10 OFFSET 14859;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Limit (cost=26571.55..26589.43 rows=10 width=1433)
-> Index Scan using index_for_listing_by_name_active on product_slugs (cost=0.00..290770.61 rows=162601 width=1433)
Index Cond: (active = 1)
(3 rows)

The index_for_listing_by_name_active index you have here isn't going to help much, since the products in the result set aren't necessarily going to be contiguous in the index. Try creating a conditional index by name on only those products which are active:
CREATE INDEX index_for_listing_active_by_name
ON product_slugs (name)
WHERE product_slugs.active = 1;

Related

SQL - can search performance depend on amount of columns?

I have something like the following table
CREATE TABLE mytable
(
id serial NOT NULL
search_col int4 NULL,
a1 varchar NULL,
a2 varchar NULL,
...
a50 varchar NULL,
CONSTRAINT mytable_pkey PRIMARY KEY (id)
);
CREATE INDEX search_col_idx ON mytable USING btree (search_col);
This table has approximately 5 million rows and it takes about 15 seconds to perform a search operation like
select *
from mytable
where search_col = 83310
It is crucial for me to increase performance, but even clustering the table after search_col did not bring a major benefit.
However, I tried the following:
create table test as (select id, search_col, a1 from mytable);
A search on this table, having the same amount of rows as the original one, takes approximately 0.2 seconds. Why that and how can I use this for what I need?
Index Scan using search_col_idx on mytable (cost=0.43..2713.83 rows=10994 width=32802) (actual time=0.021..13.015 rows=12018 loops=1)
Seq Scan on test (cost=0.00..95729.46 rows=12347 width=19) (actual time=0.246..519.501 rows=12018 loops=1)
The result of DBeaver's Execution Plan
|Knotentyp|Entität|Kosten|Reihen|Zeit|Bedingung|
|Index Scan|mytable|0.43 - 3712.86|12018|13.141|(search_col = 83310)|
Execution Plan from psql:
Index Scan using mytable_search_col_idx on mytable (cost=0.43..3712.86 rows=15053 width=32032) (actual time=0.015..13.889 rows=12018 loops=1)
Index Cond: (search_col = 83310)
Planning time: 0.640 ms
Execution time: 23.910 ms
(4 rows)
One way that the columns would impact the timing would be if the columns were large. Really large.
In most cases, a row resides on a single data page. The index points to the page and the size of the row has little impact on the timing, because the timing is dominated by searching the index and fetching the row.
However, if the columns are really large, then that can require reading many more bytes from disk, which takes more time.
That said, another possibility is that the statistics are out-of-date and the index isn't being used on the first query.

Optimizing a row exclusion query

I am designing a mostly read-only database containing 300,000 documents with around 50,000 distinct tags, with each document having 15 tags on average. For now, the only query I care about is selecting all documents with no tag from a given set of tags. I'm only interested in the document_id column (no other columns in the result).
My schema is essentially:
CREATE TABLE documents (
document_id SERIAL PRIMARY KEY,
title TEXT
);
CREATE TABLE tags (
tag_id SERIAL PRIMARY KEY,
name TEXT UNIQUE
);
CREATE TABLE documents_tags (
document_id INTEGER REFERENCES documents,
tag_id INTEGER REFERENCES tags,
PRIMARY KEY (document_id, tag_id)
);
I can write this query in Python by pre-computing the set of documents for a given tag, which reduces the problem to a few fast set operations:
In [17]: %timeit all_docs - (tags_to_docs[12345] | tags_to_docs[7654])
100 loops, best of 3: 13.7 ms per loop
Translating the set operations to Postgres doesn't work that fast, however:
stuff=# SELECT document_id AS id FROM documents WHERE document_id NOT IN (
stuff(# SELECT documents_tags.document_id AS id FROM documents_tags
stuff(# WHERE documents_tags.tag_id IN (12345, 7654)
stuff(# );
document_id
---------------
...
Time: 201.476 ms
Replacing NOT IN with EXCEPT makes it even slower.
I have btree indexes on document_id and tag_id in all three tables and another one on (document_id, tag_id).
The default memory limits on Postgres' process have been increased significantly, so I don't think Postgres is misconfigured.
How do I speed up this query? Is there any way to pre-compute the mapping between like I did with Python, or am I thinking about this the wrong way?
Here is the result of an EXPLAIN ANALYZE:
EXPLAIN ANALYZE
SELECT document_id AS id FROM documents
WHERE document_id NOT IN (
SELECT documents_tags.documents_id AS id FROM documents_tags
WHERE documents_tags.tag_id IN (12345, 7654)
);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on documents (cost=20280.27..38267.57 rows=83212 width=4) (actual time=176.760..300.214 rows=20036 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 146388
SubPlan 1
-> Bitmap Heap Scan on documents_tags (cost=5344.61..19661.00 rows=247711 width=4) (actual time=32.964..89.514 rows=235093 loops=1)
Recheck Cond: (tag_id = ANY ('{12345,7654}'::integer[]))
Heap Blocks: exact=3300
-> Bitmap Index Scan on documents_tags__tag_id_index (cost=0.00..5282.68 rows=247711 width=0) (actual time=32.320..32.320 rows=243230 loops=1)
Index Cond: (tag_id = ANY ('{12345,7654}'::integer[]))
Planning time: 0.117 ms
Execution time: 303.289 ms
(11 rows)
Time: 303.790 ms
The only settings I changed from the default configuration were:
shared_buffers = 5GB
temp_buffers = 128MB
work_mem = 512MB
effective_cache_size = 16GB
Running Postgres 9.4.5 on a server with 64GB RAM.
Optimize setup for read performance
Your memory settings seem reasonable for a 64GB server - except maybe work_mem = 512MB. That's high. Your queries are not particularly complex and your tables are not that big.
4.5 million rows (300k x 15) in the simple junction table documents_tags should occupy ~ 156 MB and the PK another 96 MB. For your query you typically don't need to read the whole table, just small parts of the index. For "mostly read-only" like you have, you should see index-only scans on the index of the PK exclusively. You don't need nearly as much work_mem - which may not matter much - except where you have many concurrent queries. Quoting the manual:
... several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of work_mem; it is necessary to keep this fact in mind when choosing the value.
Setting work_mem too high may actually impair performance:
Increasing work_mem and shared_buffers on Postgres 9.2 significantly slows down queries
I suggest to reduce work_mem to 128 MB or less to avoid possible memory starvation- unless you have other common queries that require more. You can always set it higher locally for special queries.
There are several other angles to optimize read performance:
Configuring PostgreSQL for read performance
Key problem: leading index column
All of this may help a little. But the key problem is this:
PRIMARY KEY (document_id, tag_id)
300k documents, 2 tags to exclude. Ideally, you have an index with tag_id as leading column and document_id as 2nd. With an index on just (tag_id) you can't get index-only scans. If this query is your only use case, change your PK as demonstrated below.
Or probably even better: you can create an additional plain index on (tag_id, document_id) if you need both - and drop the two other indexes on documents_tags on just (tag_id) and (document_id). They offer nothing over the two multicolumn indexes. The remaining 2 indexes (as opposed to 3 indexes before) are smaller and superior in every way. Rationale:
Is a composite index also good for queries on the first field?
Working of indexes in PostgreSQL
While being at it, I suggest to also CLUSTER the table using the new PK, all in one transaction, possibly with some extra maintenance_work_mem locally:
BEGIN;
SET LOCAL maintenance_work_mem = '256MB';
ALTER TABLE documents_tags
DROP CONSTRAINT documents_tags_pkey
, ADD PRIMARY KEY (tag_id, document_id); -- tag_id first.
CLUSTER documents_tags USING documents_tags_pkey;
COMMIT;
Don't forget to:
ANALYZE documents_tags;
Queries
The query itself is run-of-the-mill. Here are the 4 standard techniques:
Select rows which are not present in other table
NOT IN is - to quote myself:
Only good for small sets without NULL values
Your use case exactly: all involved columns NOT NULL and your list of excluded items is very short. Your original query is a hot contender.
NOT EXISTS and LEFT JOIN / IS NULL are always hot contenders. Both have been suggested in other answers. LEFT JOIN has to be an actual LEFT [OUTER] JOIN, though.
EXCEPT ALL would be shortest, but often not as fast.
1. NOT IN
SELECT document_id
FROM documents d
WHERE document_id NOT IN (
SELECT document_id -- no need for column alias, only value is relevant
FROM documents_tags
WHERE tag_id IN (12345, 7654)
);
2. NOT EXISTS
SELECT document_id
FROM documents d
WHERE NOT EXISTS (
SELECT 1
FROM documents_tags
WHERE document_id = d.document_id
AND tag_id IN (12345, 7654)
);
3. LEFT JOIN / IS NULL
SELECT d.document_id
FROM documents d
LEFT JOIN documents_tags dt ON dt.document_id = d.document_id
AND dt.tag_id IN (12345, 7654)
WHERE dt.document_id IS NULL;
4. EXCEPT ALL
SELECT document_id
FROM documents
EXCEPT ALL -- ALL, to keep duplicate rows and make it faster
SELECT document_id
FROM documents_tags
WHERE tag_id IN (12345, 7654);
Benchmark
I ran a quick benchmark on my old laptop with 4 GB RAM and Postgres 9.5.3 to put my theories to the test:
Test setup
SET random_page_cost = 1.1;
SET work_mem = '128MB';
CREATE SCHEMA temp;
SET search_path = temp, public;
CREATE TABLE documents (
document_id serial PRIMARY KEY,
title text
);
-- CREATE TABLE tags ( ... -- actually irrelevant for this query
CREATE TABLE documents_tags (
document_id integer REFERENCES documents,
tag_id integer -- REFERENCES tags -- irrelevant for test
-- no PK yet, to test seq scan
-- it's also faster to create the PK after filling the big table
);
INSERT INTO documents (title)
SELECT 'some dummy title ' || g
FROM generate_series(1, 300000) g;
INSERT INTO documents_tags(document_id, tag_id)
SELECT i.*
FROM documents d
CROSS JOIN LATERAL (
SELECT DISTINCT d.document_id, ceil(random() * 50000)::int
FROM generate_series (1,15)) i;
ALTER TABLE documents_tags ADD PRIMARY KEY (document_id, tag_id); -- your current index
ANALYZE documents_tags;
ANALYZE documents;
Note that rows in documents_tags are physically clustered by document_id due to the way I filled the table - which is likely your current situation as well.
Test
3 test runs with each of the 4 queries, best of 5 every time to exclude caching effects.
Test 1: With documents_tags_pkey like you have it. Index and physical order of rows are bad for our query.
Test 2: Recreate the PK on (tag_id, document_id) like suggested.
Test 3: CLUSTER on new PK.
Execution time of EXPLAIN ANALYZE in ms:
time in ms | Test 1 | Test 2 | Test 3
1. NOT IN | 654 | 70 | 71 -- winner!
2. NOT EXISTS | 684 | 103 | 97
3. LEFT JOIN | 685 | 98 | 99
4. EXCEPT ALL | 833 | 255 | 250
Conclusions
Key element is the right index with leading tag_id - for queries involving few tag_id and many document_id.
To be precise, it's not important that there are more distinct document_id than tag_id. This could be the other way round as well. Btree indexes basically perform the same with any order of columns. It's the fact that the most selective predicate in your query filters on tag_id. And that's faster on the leading index column(s).
The winning query for few tag_id to exclude is your original with NOT IN.
NOT EXISTS and LEFT JOIN / IS NULL result in the same query plan. For more than a couple of dozen IDs, I expect these to scale better ...
In a read-only situation you'll see index-only scans exclusively, so the physical order of rows in the table becomes irrelevant. Hence, test 3 did not bring any more improvements.
If writes to the table happen and autovacuum can't keep up, you'll see (bitmap) index scans. Physical clustering is important for those.
Use an outer join, with the tag condition on the join, keeping only missed joins to return where none of the specified tags match:
select d.id
from documents d
join documents_tags t on t.document_id = d.id
and t.tag_id in (12345, 7654)
where t.document_id is null

Optimize PostgreSQL query with ORDER BY and limit 1

I have the following PostgreSQL schema:
CREATE TABLE User (
ID INTEGER PRIMARY KEY
);
CREATE TABLE BOX (
ID INTEGER PRIMARY KEY
);
CREATE SEQUENCE seq_item;
CREATE TABLE Item (
ID INTEGER PRIMARY KEY DEFAULT nextval('seq_item'),
SENDER INTEGER REFERENCES User(id),
RECEIVER INTEGER REFERENCES User(id),
INFO TEXT,
BOX_ID INTEGER REFERENCES Box(id) NOT NULL,
ARRIVAL TIMESTAMP
);
Its main use case is a typical producer/consumer scenario. Different users may insert an item in the database in a particular box for a particular user and each user can retrieve the topmost(this means the oldest) item in a box that is addressed to her/him. It more or less mimics the functionality of a queue on a database level.
More precisely, the most common operations are the following:
INSERT INTO ITEM(SENDER, RECEIVER, INFO, BOX_ID, ARRIVAL)
VALUES (nsid, nrid, ncontent, nqid, ntime);
And retrieve commands based on a combination of either RECEIVER+SENDER or RECEIVER+BOX_ID:
SELECT * INTO it FROM Item i WHERE (i.RECEIVER=? OR i.RECEIVER is NULL) AND
(i.BOX_ID=?) ORDER BY ARRIVAL LIMIT 1;
DELETE FROM Item i WHERE i.id=it.id;
and
SELECT * INTO it FROM Item i WHERE (i.RECEIVER=? OR i.RECEIVER is NULL) AND
(i.SENDER=?) ORDER BY ARRIVAL LIMIT 1;
DELETE FROM Item i WHERE i.id=it.id;
The last two snippets are packed within a stored procedure.
I was wondering how to achieve best performance given this use case and knowing that the users will insert and retrieve somewhere between 50,000 and 500,000 items (however, the database is never expected to contain more than 100,000 items at a given point)?
EDIT
This is the EXPLAIN I get with for the SELECT statements no indexes:
Limit (cost=23.07..23.07 rows=1 width=35)
-> Sort (cost=23.07..25.07 rows=799 width=35)
Sort Key: ARRIVAL
-> Seq Scan on Item i (cost=0.00..19.07 rows=799 width=35)
Filter: (((RECEIVER = 1) OR (RECEIVER IS NULL)) AND (SENDER = 1))
The best EXPLAIN I get based on my understanding is when I put an index on the time(CREATE INDEX ind ON Item(ARRIVAL);):
Limit (cost=0.42..2.88 rows=1 width=35)
-> Index Scan using ti on Item i (cost=0.42..5899.42 rows=2397 width=35)
Filter: (((receiver = 2) OR (RECEIVER IS NULL)) AND (SENDER = 2))
In all of the cases without index on ARRIVAL I have to sort the table which seems to my inefficient. If I try to combine an index on ARRIVAL and RECEIVER/SENDER I get the same explanation, but slightly slower.
Is it correct to assume that a single index on ARRIVAL is the most efficient option?
Regarding index the best way is create, test your query and analyze the EXPLAIN plan. Sometime you create the index and planer doesnt even use it. You will know when you test it.
Primary key get index by default, you need create the index for the referenced table
Postgres and Indexes on Foreign Keys and Primary Keys
And you may consider create composited index using the fields on your where clausules.
Take note if even index improve selects, also have an impact on insert/updates because index need to be rebuild.
But again you have to test each change and see if that improve your results.

Right index for timestamp field on Postgresql

here is the table
CREATE TABLE material
(
mid bigserial NOT NULL,
...
active_from timestamp without time zone,
....
CONSTRAINT material_pkey PRIMARY KEY (mid),
)
CREATE INDEX i_test_t_year
ON material
USING btree
(date_part('year'::text, active_from));
if I made sorting by mid field
select mid from material order by mid desc
"Index Only Scan Backward using material_pkey on material (cost=0.29..3573.20 rows=100927 width=8)"
but if I use active_from for sorting
select * from material order by active_from desc
"Sort (cost=12067.29..12319.61 rows=100927 width=16)"
" Sort Key: active_from"
" -> Seq Scan on material (cost=0.00..1953.27 rows=100927 width=16)"
Maybe index for active_from wrong? How to make right one for lower cost
The index on date_part('year'::text, active_from) can't be used to sort by active_from; you know that sorting by that function and then by active_from gives the same order as simply sorting by active_from but postgresql doesn't. If you create the following index:
CREATE INDEX i_test_t_year ON material (active_from);
then postgresql will be able to use it to answer the query:
Index Scan Backward using i_test_t_year on material (cost=0.15..74.70 rows=1770 width=16)
However, remember that postgresql will only use the index if it thinks it will be faster than doing a sequential scan then sorting, so creating the correct index doesn't guarantee that it will be used for this query.
YOu need to fully understand an index and I believe this will answer your question.
An index is literally a lookup stored in it's own memory next to the table. You literally look at the index and then it points to the rows to fetch.
If you look at your index, what you are storing is a TEXT value of the 'year' extracted from the "active_from" column.
So if you were to look at the index, it will look like a bunch of entries saying:
2015
2015
2014
2014
2013
etc.
They are stored as TEXT value, not as the timestamp.
In your query you are ordering it DESC as a timestamp value.
So it just doesn't match the index, as you have stored it.
If you put the ORDER BY in your query as "order by date_part('year'::text,active_from)" then it would call the index you put there.
So I suggest you just add the index on "active_from" with out parsing the date at all.

What is the correct way to optimize and/or index this query?

I've got a table pings with about 15 million rows in it. I'm on postgres 9.2.4. The relevant columns it has are a foreign key monitor_id, a created_at timestamp, and a response_time that's an integer that represents milliseconds. Here is the exact structure:
Column | Type | Modifiers
-----------------+-----------------------------+----------------------------------------------------
id | integer | not null default nextval('pings_id_seq'::regclass)
url | character varying(255) |
monitor_id | integer |
response_status | integer |
response_time | integer |
created_at | timestamp without time zone |
updated_at | timestamp without time zone |
response_body | text |
Indexes:
"pings_pkey" PRIMARY KEY, btree (id)
"index_pings_on_created_at_and_monitor_id" btree (created_at DESC, monitor_id)
"index_pings_on_monitor_id" btree (monitor_id)
I want to query for all the response times that are not NULL (90% won't be NULL, about 10% will be NULL), that have a specific monitor_id, and that were created in the last month. I'm doing the query with ActiveRecord, but the end result looks something like this:
SELECT "pings"."response_time"
FROM "pings"
WHERE "pings"."monitor_id" = 3
AND (created_at > '2014-03-03 20:23:07.254281'
AND response_time IS NOT NULL)
It's a pretty basic query, but it takes about 2000ms to run, which seems rather slow. I'm assuming an index would make it faster, but all the indexes I've tried aren't working, which I'm assuming means I'm not indexing properly.
When I run EXPLAIN ANALYZE, this is what I get:
Bitmap Heap Scan on pings (cost=6643.25..183652.31 rows=83343 width=4) (actual time=58.997..1736.179 rows=42063 loops=1)
Recheck Cond: (monitor_id = 3)
Rows Removed by Index Recheck: 11643313
Filter: ((response_time IS NOT NULL) AND (created_at > '2014-03-03 20:23:07.254281'::timestamp without time zone))
Rows Removed by Filter: 324834
-> Bitmap Index Scan on index_pings_on_monitor_id (cost=0.00..6622.41 rows=358471 width=0) (actual time=57.935..57.935 rows=366897 loops=1)
Index Cond: (monitor_id = 3)
So there is an index on monitor_id that is being used towards the end, but nothing else. I've tried various permutations and orders of compound indexes using monitor_id, created_at, and response_time. I've tried ordering the index by created_at in descending order. I've tried a partial index with response_time IS NOT NULL.
Nothing I've tried makes the query any faster. How would you optimize and/or index it?
Sequence of columns
Create a partial multicolumn index with the right sequence of columns. You have one:
"index_pings_on_created_at_and_monitor_id" btree (created_at DESC, monitor_id)
But the sequence of columns is not serving you well. Reverse it:
CREATE INDEX idx_pings_monitor_created ON pings (monitor_id, created_at DESC)
WHERE response_time IS NOT NULL;
The rule of thumb here is: equality first, ranges later. More about that:
Multicolumn index and performance
As discussed, the condition WHERE response_time IS NOT NULL does not buy you much. If you have other queries that could utilize this index including NULL values in response_time, drop it. Else, keep it.
You can probably also drop both other existing indexes. More about the sequence of columns in btree indexes:
Working of indexes in PostgreSQL
Covering index
If all you need from the table is response_time, this can be much faster yet - if you don't have lots of write operations on the rows of your table. Include the column in the index at the last position to allow index-only scans (making it a "covering index"):
CREATE INDEX idx_pings_monitor_created
ON pings (monitor_id, created_at DESC, response_time)
WHERE response_time IS NOT NULL; -- maybe
Or, you try this even ..
More radical partial index
Create a tiny helper function. Effectively a "global constant" in your db:
CREATE OR REPLACE FUNCTION f_ping_event_horizon()
RETURNS timestamp LANGUAGE sql IMMUTABLE COST 1 AS
$$SELECT '2014-03-03 0:0'::timestamp$$; -- One month in the past
Use it as condition in your index:
CREATE INDEX idx_pings_monitor_created_response_time
ON pings (monitor_id, created_at DESC, response_time)
WHERE response_time IS NOT NULL -- maybe
AND created_at > f_ping_event_horizon();
And your query looks like this now:
SELECT response_time
FROM pings
WHERE monitor_id = 3
AND response_time IS NOT NULL
AND created_at > '2014-03-03 20:23:07.254281'
AND created_at > f_ping_event_horizon();
Aside: I trimmed some noise.
The last condition seems logically redundant. Only include it, if Postgres does not understand it can use the index without it. Might be necessary. The actual timestamp in the condition must be bigger than the one in the function. But that's obviously the case according to your comments.
This way we cut all the irrelevant rows and make the index much smaller. The effect degrades slowly over time. Refit the event horizon and recreate indexes from time to time to get rid of added weight. You could do with a weekly cron job, for example.
When updating (recreating) the function, you need to recreate all indexes that use the function in any way. Best in the same transaction. Because the IMMUTABLE declaration for the helper function is a bit of a false promise. But Postgres only accepts immutable functions in index definitions. So we have to lie about it. More about that:
Does PostgreSQL support "accent insensitive" collations?
Why the function at all? This way, all the queries using the index can remain unchanged.
With all of these changes the query should be faster by orders of magnitude now. A single, continuous index-only scan is all that's needed. Can you confirm that?