I'm using will_paginate to get the top 10-20 rows from a table, but I've found that the simple query it produces is scanning the entire table.
sqlite> explain query plan
SELECT "deals".* FROM "deals" ORDER BY created_at DESC LIMIT 10 OFFSET 0;
0|0|0|SCAN TABLE deals (~1000000 rows)
0|0|0|USE TEMP B-TREE FOR ORDER BY
If I was using a WITH clause and indexes, I'm sure it would be different, but this is just displaying the newest posts on the top page of the site. I did find a post or two on here that suggested adding indexes anyway, but I don't see how it can help with the table scan.
sqlite> explain query plan
SELECT deals.id FROM deals ORDER BY id DESC LIMIT 10 OFFSET 0;
0|0|0|SCAN TABLE deals USING INTEGER PRIMARY KEY (~1000000 rows)
It seems like a common use case, so how is it typically done efficiently?
The ORDER BY created_at DESC requires the database to search for the largest values in the entire table.
To speed up this search, you would need an index on the created_at column.
Related
The problem
Using PostgreSQL 13, I ran into a performance issue selecting the highest id from a view that joins two tables, depending on the select statement I execute.
Here's a sample setup:
CREATE TABLE test1 (
id BIGSERIAL PRIMARY KEY,
joincol VARCHAR
);
CREATE TABLE test2 (
joincol VARCHAR
);
CREATE INDEX ON test1 (id);
CREATE INDEX ON test1 (joincol);
CREATE INDEX ON test2 (joincol);
CREATE VIEW testview AS (
SELECT test1.id,
test1.joincol AS t1charcol,
test2.joincol AS t2charcol
FROM test1, test2
WHERE test1.joincol = test2.joincol
);
What I found out
I'm executing two statements which result in completely different execution plans and runtimes. The following statement executes in less than 100ms. As far as I understand the execution plan, the runtime is independent of the rowcount, since Postgres iterates the rows one by one (starting at the highest id, using the index) until a join on a row is possible and immediately returns.
SELECT id FROM testview ORDER BY ID DESC LIMIT 1;
However, this one takes over 1 second on average (depending on rowcount), since the two tables are "joined completely", before Postgres uses the index to select the highest id.
SELECT MAX(id) FROM testview;
Please refer to this sample on dbfiddle to check the explain plans:
https://www.db-fiddle.com/f/bkMNeY6zXqBAYUsprJ5eWZ/1
My real environment
On my real environment test1 contains only a hand full of rows (< 100), having unique values in joincol. test2 contains up to ~10M rows, where joincol always matches a value of test1's joincol. test2's joincol is not nullable.
The actual question
Why does Postgres not recognize that it could use an Index Scan Backward on row basis for the second select? Is there anything I could improve on the tables/indexes?
Queries not strictly equivalent
why does Postgres not recognize that it could use a Index Scan Backward on row basis for the second select?
To make the context clear:
max(id) excludes NULL values. But ORDER BY ... LIMIT 1 does not.
NULL values sort last in ascending sort order, and first in descending. So an Index Scan Backward might not find the greatest value (according to max()) first, but any number of NULL values.
The formal equivalent of:
SELECT max(id) FROM testview;
is not:
SELECT id FROM testview ORDER BY id DESC LIMIT 1;
but:
SELECT id FROM testview ORDER BY id DESC NULLS LAST LIMIT 1;
The latter query doesn't get the fast query plan. But it would with an index with matching sort order: (id DESC NULLS LAST).
That's different for the aggregate functions min() and max(). Those get a fast plan when targeting table test1 directly using the plain PK index on (id). But not when based on the view (or the underlying join-query directly - the view is not the blocker). An index sorting NULL values in the right place has hardly any effect.
We know that id in this query can never be NULL. The column is defined NOT NULL. And the join in the view is effectively an INNER JOIN which cannot introduce NULL values for id.
We also know that the index on test.id cannot contain NULL values.
But the Postgres query planner is not an AI. (Nor does it try to be, that could get out of hands quickly.) I see two shortcomings:
min() and max() get the fast plan only when targeting the table, regardless of index sort order, an index condition is added: Index Cond: (id IS NOT NULL)
ORDER BY ... LIMIT 1 gets the fast plan only with the exactly matching index sort order.
Not sure, whether that might be improved (easily).
db<>fiddle here - demonstrating all of the above
Indexes
Is there anything I could improve on the tables/indexes?
This index is completely useless:
CREATE INDEX ON "test" ("id");
The PK on test.id is implemented with a unique index on the column, that already covers everything the additional index might do for you.
There may be more, waiting for the question to clear up.
Distorted test case
The test case is too far away from actual use case to be meaningful.
In the test setup, each table has 100k rows, there is no guarantee that every value in joincol has a match on the other side, and both columns can be NULL
Your real case has 10M rows in table1 and < 100 rows in table2, every value in table1.joincol has a match in table2.joincol, both are defined NOT NULL, and table2.joincol is unique. A classical one-to-many relationship. There should be a UNIQUE constraint on table2.joincol and a FK constraint t1.joincol --> t2.joincol.
But that's currently all twisted in the question. Standing by till that's cleaned up.
This is a very good problem, and good testcase.
I tested it in postgres 9.3 perhaps 13 is can it more more fast.
I used Occam's Razor and i excluded some possiblities
View (without view is slow to)
JOIN can filter some rows (unfortunatly in your test not, but more length md5 5-6 yes)
Other basic equivalent select statements not solve yout problem (inner query or exists)
I achieved to use just index, but because the tables isn't bigger than indexes it was not the solution.
I think
CREATE INDEX on "test" ("id");
is useless, because PK!
If you change this
CREATE INDEX on "test" ("joincol");
to this
CREATE INDEX ON TEST (joincol, id);
Than the second query use just indexes.
After you run this
REINDEX table test;
REINDEX table test2;
VACUUM ANALYZE test;
VACUUM ANALYZE test2;
you can achive some performance tuning. Because you created indexes before inserts.
I think the reason is the two aim of DB.
First aim optimalize just some row. So run Nested Loop. You can force it with limit x.
Second aim optimalize whole table. Run this query fast for whole table.
In this situation postgres optimalizer didn't notice that simple MAX can run with NESTED LOOP. Or perhaps postgres cannot use limit in aggregate clause (can run on whole partial select, what is filtered with query).
And this is very expensive. But you have possiblities to write there other aggregates, like SUM, MIN, AVG stb.
Perhaps can help you the Window functions too.
Say I have a table with a thousand users and 50 million user_actions. A few users have more than a million actions but most have thousands.
CREATE TABLE users (id, name)
CREATE TABLE user_actions (id, user_id, created_at)
CREATE INDEX index_user_actions_on_user_id ON user_actions(user_id)
Querying user_actions by user_id is fast, using the index.
SELECT *
FROM user_actions
WHERE user_id = ?
LIMIT 1
But I'd like to know the last action by a user.
SELECT *
FROM user_actions
WHERE user_id = ?
ORDER BY created_at DESC
LIMIT 1
This query throws out the index and does a table scan, backwards until it finds an action. Not a problem for users that have been active recently, too slow for users that haven't.
Is there a way to tune this index so postgres keeps track of the last action by each user? (For bonus points the last N actions!)
Or, suggested alternate strategies? I suppose a materialized view of a window function will do the trick.
Create an index on (user_id, created_at)
This will allow PostgreSQL to do a index scan to locate the first record.
This is one of the cases where multi-column indexes make a big difference.
Note we put user_id first because that allows us to efficiently select the sub-portion of the index we are interested in, and then from there it is just a quick traversal to get the most recent created_at date, provided not a lot of dead rows in the area.
I have a schema that looks like this:
create table image_tags (
image_tag_id serial primary key,
image_id int not null
);
create index on image_tags(image_id);
When I execute a query with two columns, it is ridiculously slow (eg, select * from image_tags order by image_id desc, image_tag_id desc limit 10;). If I drop one of those columns in the sort (doesn't matter which), it is super fast.
I used explain on both queries, but it didn't help me understand why two columns in the order by clause were so slow, it just showed me how much slower using two columns was.
For order by image_id desc, image_tag_id desc sorting to be optimized via indexes you need to have this index:
create index image_tags_id_tag on image_tags(image_id, image_tag_id);
Only having a composite index (with little exceptions I presume, but not in this case) would help optimizer to use it to determine the order straight away.
create index on image_tags(image_id, image_tag_id);
try indexing..
You only have an index for one of the columns associated with the query you want to execute, for a better speed you should create a two column index such as
create index on image_tags(image_id, image_tag_id);
Why does this query take a long time (30+ seconds) to run?
A:
SELECT TOP 10 * FROM [Workflow] ORDER BY ID DESC
But this query is fast (0 seconds):
B:
SELECT TOP 10 * FROM [Workflow] ORDER BY ReadTime DESC
And this is fast (0 seconds), too:
C:
SELECT TOP 10 * FROM [Workflow] WHERE SubId = '120611250634'
I get why B and C are fast:
B:
C:
But I don’t get why A takes so long when we have this:
Edit: The estimated execution plan using ID:
Edit: The estimated execution plan using ReadTime:
Well, your primary key is for both ID (ASC) and ReadTime (ASC). The order is not important when you're only having a single column index, but it does matter when you have more columns in the index (a composite key).
Composite clustered keys are not really made for ordering. I'd expect that using
SELECT TOP 10 * FROM [Workflow] ORDER BY ID ASC
Will be rather fast, and the best would be
SELECT TOP 10 * FROM [Workflow] ORDER BY ID, ReadTime
Reversing the order is a tricky operation on a composite key.
So in effect, when you order by ReadTime, you have an index ready for that, and that index also knows the exact key of the row involved (both its Id and ReadTime - another good reason to keep the clustered index very narrow). It can look up all the columns rather easily. However, when you order by Id, you don't have an exact fit of an index. The server doesn't trivially know how many rows there are for a given Id, which means the top gets a bit trickier than you'd guess. In effect, your clustered index turns into a waste of space and performance (as far as those sample queries are concerned).
Seeing just the tiny part of your database, I'd say having a clustered index on Id and ReadTime is a bad idea. Why do you do that?
It looks like ID isn't a PK by itself, but along with ReadTime (based on your 3rd picture).
Therefore the index is built on the (ID,ReadTime) pair, and this index isn't used by your query.
Try adding an index on ID only.
I'm trying to fetch most recent row in a table. I have a simple timestamp created_at which is indexed. When I query ORDER BY created_at DESC LIMIT 1, it takes far more than I think it should (about 50ms on my machine on 36k rows).
EXPLAIN-ing claims that it uses backwards index scan, but I confirmed that changing the index to be (created_at DESC) does not change the cost in query planner for a simple index scan.
How can I optimize this use case?
Running postgresql 9.2.4.
Edit:
# EXPLAIN SELECT * FROM articles ORDER BY created_at DESC LIMIT 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.58 rows=1 width=1752)
-> Index Scan Backward using index_articles_on_created_at on articles (cost=0.00..20667.37 rows=35696 width=1752)
(2 rows)
Assuming we are dealing with a big table, a partial index might help:
CREATE INDEX tbl_created_recently_idx ON tbl (created_at DESC)
WHERE created_at > '2013-09-15 0:0'::timestamp;
As you already found out: descending or ascending hardly matters here. Postgres can scan backwards at almost the same speed (exceptions apply with multi-column indices).
Query to use this index:
SELECT * FROM tbl
WHERE created_at > '2013-09-15 0:0'::timestamp -- matches index
ORDER BY created_at DESC
LIMIT 1;
The point here is to make the index much smaller, so it should be easier to cache and maintain.
You need to pick a timestamp that is guaranteed to be smaller than the most recent one.
You should recreate the index from time to time to cut off old data.
The condition needs to be IMMUTABLE.
So the one-time effect deteriorates over time. The specific problem is the hard coded condition:
WHERE created_at > '2013-09-15 0:0'::timestamp
Automate
You could update the index and your queries manually from time to time. Or you automate it with the help of a function like this one:
CREATE OR REPLACE FUNCTION f_min_ts()
RETURNS timestamp LANGUAGE sql IMMUTABLE AS
$$SELECT '2013-09-15 0:0'::timestamp$$
Index:
CREATE INDEX tbl_created_recently_idx ON tbl (created_at DESC);
WHERE created_at > f_min_ts();
Query:
SELECT * FROM tbl
WHERE created_at > f_min_ts()
ORDER BY created_at DESC
LIMIT 1;
Automate recreation with a cron job or some trigger-based event. Your queries can stay the same now. But you need to recreate all indices using this function in any way after changing it. Just drop and create each one.
First ..
... test whether you are actually hitting the bottle neck with this.
Try whether a simple DROP index ... ; CREATE index ... does the job. Then your index might have been bloated. Your autovacuum settings may be off.
Or try VACUUM FULL ANALYZE to get your whole table plus indices in pristine condition and check again.
Other options include the usual general performance tuning and covering indexes, depending on what you actually retrieve from the table.