Optimize PostgreSQL query with ORDER BY and limit 1 - sql

I have the following PostgreSQL schema:
CREATE TABLE User (
ID INTEGER PRIMARY KEY
);
CREATE TABLE BOX (
ID INTEGER PRIMARY KEY
);
CREATE SEQUENCE seq_item;
CREATE TABLE Item (
ID INTEGER PRIMARY KEY DEFAULT nextval('seq_item'),
SENDER INTEGER REFERENCES User(id),
RECEIVER INTEGER REFERENCES User(id),
INFO TEXT,
BOX_ID INTEGER REFERENCES Box(id) NOT NULL,
ARRIVAL TIMESTAMP
);
Its main use case is a typical producer/consumer scenario. Different users may insert an item in the database in a particular box for a particular user and each user can retrieve the topmost(this means the oldest) item in a box that is addressed to her/him. It more or less mimics the functionality of a queue on a database level.
More precisely, the most common operations are the following:
INSERT INTO ITEM(SENDER, RECEIVER, INFO, BOX_ID, ARRIVAL)
VALUES (nsid, nrid, ncontent, nqid, ntime);
And retrieve commands based on a combination of either RECEIVER+SENDER or RECEIVER+BOX_ID:
SELECT * INTO it FROM Item i WHERE (i.RECEIVER=? OR i.RECEIVER is NULL) AND
(i.BOX_ID=?) ORDER BY ARRIVAL LIMIT 1;
DELETE FROM Item i WHERE i.id=it.id;
and
SELECT * INTO it FROM Item i WHERE (i.RECEIVER=? OR i.RECEIVER is NULL) AND
(i.SENDER=?) ORDER BY ARRIVAL LIMIT 1;
DELETE FROM Item i WHERE i.id=it.id;
The last two snippets are packed within a stored procedure.
I was wondering how to achieve best performance given this use case and knowing that the users will insert and retrieve somewhere between 50,000 and 500,000 items (however, the database is never expected to contain more than 100,000 items at a given point)?
EDIT
This is the EXPLAIN I get with for the SELECT statements no indexes:
Limit (cost=23.07..23.07 rows=1 width=35)
-> Sort (cost=23.07..25.07 rows=799 width=35)
Sort Key: ARRIVAL
-> Seq Scan on Item i (cost=0.00..19.07 rows=799 width=35)
Filter: (((RECEIVER = 1) OR (RECEIVER IS NULL)) AND (SENDER = 1))
The best EXPLAIN I get based on my understanding is when I put an index on the time(CREATE INDEX ind ON Item(ARRIVAL);):
Limit (cost=0.42..2.88 rows=1 width=35)
-> Index Scan using ti on Item i (cost=0.42..5899.42 rows=2397 width=35)
Filter: (((receiver = 2) OR (RECEIVER IS NULL)) AND (SENDER = 2))
In all of the cases without index on ARRIVAL I have to sort the table which seems to my inefficient. If I try to combine an index on ARRIVAL and RECEIVER/SENDER I get the same explanation, but slightly slower.
Is it correct to assume that a single index on ARRIVAL is the most efficient option?

Regarding index the best way is create, test your query and analyze the EXPLAIN plan. Sometime you create the index and planer doesnt even use it. You will know when you test it.
Primary key get index by default, you need create the index for the referenced table
Postgres and Indexes on Foreign Keys and Primary Keys
And you may consider create composited index using the fields on your where clausules.
Take note if even index improve selects, also have an impact on insert/updates because index need to be rebuild.
But again you have to test each change and see if that improve your results.

Related

SQLite `explain query plan` shows not every step?

This is some pseudo SQL in which the 'problem' is easily replicated:
create table Child (
childId text primary key,
some_int int not null
);
create table Person (
personId text primary key,
childId text,
foreign key (childId) references Child (childId) on delete cascade
);
create index Person_childId on Person (childId);
explain query plan select count(1)
from Person
left outer join Child on Child.childId = Person.childId
where Person.childId is null or Child.some_int = 0;
The result of the query plan is this:
SCAN Person USING COVERING INDEX Person_childId
SEARCH Child USING INDEX sqlite_autoindex_Child_1 (childId=?) LEFT-JOIN
This looks great right? But I am curious if this is the 'full' plan. This is because some_int does not have an index. But the query plan does not uncover this, I don't see the filtering anywhere. The database must filter on this field right?
When I execute the some_int field in a separate query, it shows a SCAN, exactly like I though I would see in the previous query plan because there is no index:
explain query plan select * from Child where some_int = 0;
Gives:
SCAN Child
Now my questions:
Why isn't there SCAN Child shown in the first query plan?
Why is there a SCAN on Person and not a SEARCH?
Is the first query plan 'quick' or do I still need to add an index?
You should take a look at this page. It explains the sqlite query planner in depth and you can find answers to all your questions.
Note that filtering conditions like WHERE some_int=0 are not displayed in the query plan because they don't affect the plan but only the result set.
In brief:
Why isn't there SCAN Child shown in the first query plan?
Because, due to the LEFT JOIN, sqlite needs to SCAN Person and, for every row of Person, use the index on ChildId to find the corresponding records in Child.
Why is there a SCAN on Person and not a SEARCH?
A SCAN means reading of all rows of a table, in the order in which they are stored. A SEARCH is a lookup of a single value in the table, using an index to find out the rowid and the using the rowid to get to that row of the table, without the need to scan all te table to find the row.
Since your query needs to read all Person.childId, it does a full SCAN.
Is the first query plan 'quick' or do I still need to add an index?
Your query is already using all the indexes it could use, so it's already as fast as you could get it.

Indexing not working with column using range operations in oracle

I have created index on timestamp column for my table, but when I am querying and checking the explain plan in oracle it is doing the full table scan rather that range scan
Below is the DDL script for the table
CREATE TABLE EVENT (
event_id VARCHAR2(100) NOT NULL,
status VARCHAR2(50) NOT NULL,
timestamp NUMBER NOT NULL,
action VARCHAR2(50) NOT NULL
);
ALTER TABLE EVENT ADD CONSTRAINT PK_EVENT PRIMARY KEY ( event_id ) ;
CREATE INDEX IX_EVENT$timestamp ON EVENT (timestamp);
Below is the explain plan query used to get the explain plan -
EXPLAIN PLAN SET STATEMENT_ID = 'test3' for select * from EVENT where timestamp between 1620741600000 and 1621900800000 and status = 'CANC';
SELECT * FROM PLAN_TABLE WHERE STATEMENT_ID = 'test3';
Here is the explain plan that oracle returned -
I am not sure why the index is not working here, rather it is still doing the full table scan even after creating the index on the timestamp column.
Can someone please help me with this.
Gordon is correct. You need this index to speed up the query you showed us.
CREATE INDEX IX_EVENT$timestamp ON EVENT (status, timestamp);
Why? Your query requires an equality match on status and then a range scan on timestamp. Without the possibility of using the index for the equality match, Oracle's optimizer seems to have decided it's cheaper to scan the table than the index.
Why did it decide that?
Who knows? Hundreds of programmers have been working on the optimizer for many decades.
Who cares? Just use the right index for the query.
The optimizer is cost based. So conceptually the optimizer will evaluate all the available plans, estimate the cost, and pick the one that has the lowest estimated cost. The costs are estimated based on statistics. The index statistics are automatically collected when an index is built. However your table statistics may not reflect real life.
An Active SQL Monitor report will help you diagnose the issue.

What is the best way to optimize sql query that hits/reads too many shared blocks?

I have the following table in my postgres db (version 11):
CREATE TABLE instance
(
id bigserial,
period DATERANGE NOT NULL,
status TEXT NOT NULL, -- (wait, active, inactive, outdated)
position BIGINT NOT NULL,
CONSTRAINT instance_pk PRIMARY KEY (id),
CONSTRAINT instance_period_check CHECK (NOT isempty(period))
);
I need to change status of instances ordered by position by batches size of 1000 from java code:
List<Instance> instances;
Long position = null;
do {
instances = dao.getInstancesBeforePeriod(fromStatus, position);
if (instances.isEmpty()) {
break;
}
processBatch(toStatus, instances);
position = instances.get(instances.size() - 1).getPosition();
} while (true);
dao.getInstancesBeforePeriod(fromStatus, position) if position == null calls the query:
SELECT id, status, period, position
FROM instance
WHERE status = :status
AND upper(period) < now()
ORDER BY position
LIMIT 1000;
if position != null calls the query:
SELECT id, status, period, position
FROM instance
WHERE status = :status
AND upper(period) < now()
AND position > :position
ORDER BY position
LIMIT 1000;
But the first query hits/reads too many shared blocks so the query fails with timeout exception.
How can I solve the problem?
What if I'll add an index on instance table:
create index concurrently instance_index_status_upper_period_position
on instance(status, upper(period), position)
But I want to the index to keep instances ordered by position. Is it possible? Should I change the first query by adding position > 0 to where clause to use such kind of index?
The explain analyze result for the first query:
I'll apreciate any ideas. Thanks! :)
The main problem of your query is that the selectivity of the predicate (status = :status AND upper(period) < now() AND position > :position) is probably bad.
What percentage of the rows does this predicate select (in average)?
Note that the LIMIT clause is irrelevant for the index optimisation.
In other words, that means the decomposition of the predicate into access predicate and filtering predicate cannot be done in an efficient way. Maybe (status, upper(period)) can be used as access and (position) as filtering predicate, or maybe (status, position) can be used as access and (upper(period)) is used as filtering. You should try both combinations since one may require less I/O or less index scanning than the other.
create index ix1 on instance(status, upper(period), position);
create index ix2 on instance(status, position, upper(period));
You have ix1 already.
Now, if the average selectivity is above 5.0% I wouldn't get my expectations too high. A full scan can be more effective.
One final trick that could help a bit is to make the index a "covering index". That is, include all selected columns from it as well. However, I doubt that it will much of a difference since it will hit the heap a maximum of 1000 times. In other words, enhance your index to:
create index ix3 on instance(status, upper(period), position, period, id);
create index ix4 on instance(status, position, upper(period), period, id);

Right index for timestamp field on Postgresql

here is the table
CREATE TABLE material
(
mid bigserial NOT NULL,
...
active_from timestamp without time zone,
....
CONSTRAINT material_pkey PRIMARY KEY (mid),
)
CREATE INDEX i_test_t_year
ON material
USING btree
(date_part('year'::text, active_from));
if I made sorting by mid field
select mid from material order by mid desc
"Index Only Scan Backward using material_pkey on material (cost=0.29..3573.20 rows=100927 width=8)"
but if I use active_from for sorting
select * from material order by active_from desc
"Sort (cost=12067.29..12319.61 rows=100927 width=16)"
" Sort Key: active_from"
" -> Seq Scan on material (cost=0.00..1953.27 rows=100927 width=16)"
Maybe index for active_from wrong? How to make right one for lower cost
The index on date_part('year'::text, active_from) can't be used to sort by active_from; you know that sorting by that function and then by active_from gives the same order as simply sorting by active_from but postgresql doesn't. If you create the following index:
CREATE INDEX i_test_t_year ON material (active_from);
then postgresql will be able to use it to answer the query:
Index Scan Backward using i_test_t_year on material (cost=0.15..74.70 rows=1770 width=16)
However, remember that postgresql will only use the index if it thinks it will be faster than doing a sequential scan then sorting, so creating the correct index doesn't guarantee that it will be used for this query.
YOu need to fully understand an index and I believe this will answer your question.
An index is literally a lookup stored in it's own memory next to the table. You literally look at the index and then it points to the rows to fetch.
If you look at your index, what you are storing is a TEXT value of the 'year' extracted from the "active_from" column.
So if you were to look at the index, it will look like a bunch of entries saying:
2015
2015
2014
2014
2013
etc.
They are stored as TEXT value, not as the timestamp.
In your query you are ordering it DESC as a timestamp value.
So it just doesn't match the index, as you have stored it.
If you put the ORDER BY in your query as "order by date_part('year'::text,active_from)" then it would call the index you put there.
So I suggest you just add the index on "active_from" with out parsing the date at all.

SQL Server Index Usage with an Order By

I have a table named Workflow. There are 38M rows in the table. There is a PK on the following columns:
ID: Identity Int
ReadTime: dateTime
If I perform the following query, the PK is not used. The query plan shows an index scan being performed on one of the nonclustered indexes plus a sort. It takes a very long time with 38M rows.
Select TOP 100 ID From Workflow
Where ID > 1000
Order By ID
However, if I perform this query, a nonclustered index (on LastModifiedTime) is used. The query plan shows an index seek being performed. The query is very fast.
Select TOP 100 * From Workflow
Where LastModifiedTime > '6/12/2010'
Order By LastModifiedTime
So, my question is this. Why isn't the PK used in the first query, but the nonclustered index in the second query is used?
Without being able to fish around in your database, there are a few things that come to my mind.
Are you certain that the PK is (id, ReadTime) as opposed to (ReadTime, id)?
What execution plan does SELECT MAX(id) FROM WorkFlow yield?
What about if you create an index on (id, ReadTime) and then retry the test, or your query?
Since Id is an identity column, having ReadTime participate in the index is superfluous. The clustered key already points to the leaf data. I recommended you modify your indexes
CREATE TABLE Workflow
(
Id int IDENTITY,
ReadTime datetime,
-- ... other columns,
CONSTRAINT PK_WorkFlow
PRIMARY KEY CLUSTERED
(
Id
)
)
CREATE INDEX idx_LastModifiedTime
ON WorkFlow
(
LastModifiedTime
)
Also, check that statistics are up to date.
Finally, If there are 38 million rows in this table, then the optimizer may conclude that specifying criteria > 1000 on a unique column is non selective, because > 99.997% of the Ids are > 1000 (if your identity seed started at 1). In order for an index to considered helpful, the optimizer must conclude that < 5% of the records would be selected. You can use an index hint to force the issue (as already stated by Dan Andrews). What is the structure of the non-clustered index that was scanned?