sqlite3 select min, max together is much slower than select them separately - sql

sqlite> explain query plan select max(utc_time) from RequestLog;
0|0|0|SEARCH TABLE RequestLog USING COVERING INDEX key (~1 rows) # very fast
sqlite> explain query plan select min(utc_time) from RequestLog;
0|0|0|SEARCH TABLE RequestLog USING COVERING INDEX key (~1 rows) # very fast
sqlite> explain query plan select min(utc_time), max(utc_time) from RequestLog;
0|0|0|SCAN TABLE RequestLog (~8768261 rows) # will be very very slow
While I use min and max separately, it works perfectly. However, sqlite will 'forget' the index while I select the min and max together for some reason. Is there any configuration I can do (I used Analyze already, it won't work)? or is there any explanation for this behavior?
EDIT1
sqlite> .schema
CREATE TABLE FixLog(
app_id text, __key__id INTEGER,
secret text, trace_code text, url text,
action text,facebook_id text,ip text,
tw_time datetime,time datetime,
tag text,to_url text,
from_url text,referer text,weight integer,
Unique(app_id, __key__id)
);
CREATE INDEX key4 on FixLog(action);
CREATE INDEX time on FixLog(time desc);
CREATE INDEX tw_time on FixLog(tw_time desc);
sqlite> explain query select min(time) from FixLog;
0|0|0|SEARCH TABLE FixLog USING COVERING INDEX time (~1 rows)
sqlite> explain query select max(time) from FixLog;
0|0|0|SEARCH TABLE FixLog USING COVERING INDEX time (~1 rows)
sqlite> explain query plan select max(time), min(time) from FixLog;
0|0|0|SCAN TABLE FixLog (~1000000 rows)

This is a known quirk of the sqlite query optimizer, as explained here: http://www.sqlite.org/optoverview.html#minmax:
Queries of the following forms will be optimized to run in logarithmic time assuming appropriate indices exist:
SELECT MIN(x) FROM table;
SELECT MAX(x) FROM table;
In order for these optimizations to occur, they must appear in exactly the form shown above - changing only the name of the table and column. It is not permissible to add a WHERE clause or do any arithmetic on the result. The result set must contain a single column. The column in the MIN or MAX function must be an indexed column.
UPDATE (2017/06/23): Recently, this has been updated to say that a query containing a single MAX or MIN might be satisfied by an index lookup (allowing for things like arithmetic); however, they still preclude having more than one such aggregation operator in a single query (so MIN,MAX will still be slow):
Queries that contain a single MIN() or MAX() aggregate function whose argument is the left-most column of an index might be satisfied by doing a single index lookup rather than by scanning the entire table. Examples:
SELECT MIN(x) FROM table;
SELECT MAX(x)+1 FROM table;

Related

Performance impact of view on aggregate function vs result set limiting

The problem
Using PostgreSQL 13, I ran into a performance issue selecting the highest id from a view that joins two tables, depending on the select statement I execute.
Here's a sample setup:
CREATE TABLE test1 (
id BIGSERIAL PRIMARY KEY,
joincol VARCHAR
);
CREATE TABLE test2 (
joincol VARCHAR
);
CREATE INDEX ON test1 (id);
CREATE INDEX ON test1 (joincol);
CREATE INDEX ON test2 (joincol);
CREATE VIEW testview AS (
SELECT test1.id,
test1.joincol AS t1charcol,
test2.joincol AS t2charcol
FROM test1, test2
WHERE test1.joincol = test2.joincol
);
What I found out
I'm executing two statements which result in completely different execution plans and runtimes. The following statement executes in less than 100ms. As far as I understand the execution plan, the runtime is independent of the rowcount, since Postgres iterates the rows one by one (starting at the highest id, using the index) until a join on a row is possible and immediately returns.
SELECT id FROM testview ORDER BY ID DESC LIMIT 1;
However, this one takes over 1 second on average (depending on rowcount), since the two tables are "joined completely", before Postgres uses the index to select the highest id.
SELECT MAX(id) FROM testview;
Please refer to this sample on dbfiddle to check the explain plans:
https://www.db-fiddle.com/f/bkMNeY6zXqBAYUsprJ5eWZ/1
My real environment
On my real environment test1 contains only a hand full of rows (< 100), having unique values in joincol. test2 contains up to ~10M rows, where joincol always matches a value of test1's joincol. test2's joincol is not nullable.
The actual question
Why does Postgres not recognize that it could use an Index Scan Backward on row basis for the second select? Is there anything I could improve on the tables/indexes?
Queries not strictly equivalent
why does Postgres not recognize that it could use a Index Scan Backward on row basis for the second select?
To make the context clear:
max(id) excludes NULL values. But ORDER BY ... LIMIT 1 does not.
NULL values sort last in ascending sort order, and first in descending. So an Index Scan Backward might not find the greatest value (according to max()) first, but any number of NULL values.
The formal equivalent of:
SELECT max(id) FROM testview;
is not:
SELECT id FROM testview ORDER BY id DESC LIMIT 1;
but:
SELECT id FROM testview ORDER BY id DESC NULLS LAST LIMIT 1;
The latter query doesn't get the fast query plan. But it would with an index with matching sort order: (id DESC NULLS LAST).
That's different for the aggregate functions min() and max(). Those get a fast plan when targeting table test1 directly using the plain PK index on (id). But not when based on the view (or the underlying join-query directly - the view is not the blocker). An index sorting NULL values in the right place has hardly any effect.
We know that id in this query can never be NULL. The column is defined NOT NULL. And the join in the view is effectively an INNER JOIN which cannot introduce NULL values for id.
We also know that the index on test.id cannot contain NULL values.
But the Postgres query planner is not an AI. (Nor does it try to be, that could get out of hands quickly.) I see two shortcomings:
min() and max() get the fast plan only when targeting the table, regardless of index sort order, an index condition is added: Index Cond: (id IS NOT NULL)
ORDER BY ... LIMIT 1 gets the fast plan only with the exactly matching index sort order.
Not sure, whether that might be improved (easily).
db<>fiddle here - demonstrating all of the above
Indexes
Is there anything I could improve on the tables/indexes?
This index is completely useless:
CREATE INDEX ON "test" ("id");
The PK on test.id is implemented with a unique index on the column, that already covers everything the additional index might do for you.
There may be more, waiting for the question to clear up.
Distorted test case
The test case is too far away from actual use case to be meaningful.
In the test setup, each table has 100k rows, there is no guarantee that every value in joincol has a match on the other side, and both columns can be NULL
Your real case has 10M rows in table1 and < 100 rows in table2, every value in table1.joincol has a match in table2.joincol, both are defined NOT NULL, and table2.joincol is unique. A classical one-to-many relationship. There should be a UNIQUE constraint on table2.joincol and a FK constraint t1.joincol --> t2.joincol.
But that's currently all twisted in the question. Standing by till that's cleaned up.
This is a very good problem, and good testcase.
I tested it in postgres 9.3 perhaps 13 is can it more more fast.
I used Occam's Razor and i excluded some possiblities
View (without view is slow to)
JOIN can filter some rows (unfortunatly in your test not, but more length md5 5-6 yes)
Other basic equivalent select statements not solve yout problem (inner query or exists)
I achieved to use just index, but because the tables isn't bigger than indexes it was not the solution.
I think
CREATE INDEX on "test" ("id");
is useless, because PK!
If you change this
CREATE INDEX on "test" ("joincol");
to this
CREATE INDEX ON TEST (joincol, id);
Than the second query use just indexes.
After you run this
REINDEX table test;
REINDEX table test2;
VACUUM ANALYZE test;
VACUUM ANALYZE test2;
you can achive some performance tuning. Because you created indexes before inserts.
I think the reason is the two aim of DB.
First aim optimalize just some row. So run Nested Loop. You can force it with limit x.
Second aim optimalize whole table. Run this query fast for whole table.
In this situation postgres optimalizer didn't notice that simple MAX can run with NESTED LOOP. Or perhaps postgres cannot use limit in aggregate clause (can run on whole partial select, what is filtered with query).
And this is very expensive. But you have possiblities to write there other aggregates, like SUM, MIN, AVG stb.
Perhaps can help you the Window functions too.

Why using MAX function in query cause postgresql performance issue?

I have a table with three columns time_stamp, device_id and status s.t status type is json. Also time_stamp and device_id columns have index. I need to grab latest value of status with id 1.3.6.1.4.1.34094.1.1.1.1.1 which is not null.
You can find query execution time of following command With and Without using MAX bellow.
Query with MAX:
SELECT DISTINCT MAX(time_stamp) FROM device.status_events WHERE
(device_id = 7) AND
(status->'1.3.6.1.4.1.34094.1.1.1.1.1' != '{}');
Query without MAX:
SELECT DISTINCT time_stamp FROM device.status_events WHERE
(device_id = 7) AND
(status->'1.3.6.1.4.1.34094.1.1.1.1.1' != '{}');
First query takes about 3sec and second one takes just 3msec with two different plans. I think both queries should have same query plan, Why it does not use index in when it wants to calculate MAX? How can improve running time of first query?
PS I use postgres 9.6(dockerized version).
Also this is table definition.
-- Table: device.status_events
-- DROP TABLE device.status_events;
CREATE TABLE device.status_events
(
time_stamp timestamp with time zone NOT NULL,
device_id bigint,
status jsonb,
is_active boolean DEFAULT true,
CONSTRAINT status_events_device_id_fkey FOREIGN KEY (device_id)
REFERENCES device.devices (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE
)
WITH (
OIDS=FALSE
);
ALTER TABLE device.status_events
OWNER TO monitoring;
-- Index: device.status_events__time_stamp
-- DROP INDEX device.status_events__time_stamp;
CREATE INDEX status_events__time_stamp
ON device.status_events
USING btree
(time_stamp);
The index you show us cannot produce the first plan you show us. With that index, the plan would have to be applying a filter for the jsonb column, which it isn't. So the index must be a partial index, with the filter being applied at the index level so that it is not needed in the plan.
PostgreSQL is using an index for the max query, it just isn't the index you want it to.
All of your devide_id=7 must have low timestamps, but PostgreSQL doesn't know this. It thinks that by walking down the timestamps index, it will quickly find a device_id=7 and then be done. But instead it needs to walk a large chunk of the index before finding such a row.
You can force it away from the "wrong" index by changing the aggregate expression to something like:
MAX(time_stamp + interval '0')
Or you could instead build a more tailored index, which the planner will choose instead of the falsely attractive one:
create index on device.status_events (device_id , time_stamp)
where status->'1.3.6.1.4.1.34094.1.1.1.1.1' != '{}';
I believe this should generate a better plan
SELECT time_stamp FROM device.status_events WHERE
(device_id = 7) AND
(status->'1.3.6.1.4.1.34094.1.1.1.1.1' != '{}')
ORDER BY timestamp DESC
LIMIT 1
Let me know how that works for you.

Is it helpful to create a multi-column index if the columns are already indexed on their own?

Let's say your table has three columns:
time (integer)
name (varchar)
other_column (varchar)
and you have two indexes:
CREATE INDEX index_time ON my_table (time);
CREATE INDEX index_name ON my_table (name);
In this case, does it make any difference if I create a new index based on both time and name? i.e.:
CREATE INDEX index_name_and_time ON my_table (name,time);
In regards overall performance the three indexes may be overkill and have a detrimental affect when inserting as there are then the three indexes to maintain and the extra memory/space utilisation.
However, the first factor would be to ascertain if the indexes would actually be utilised which depends upon what queries are to be run.
From a brief play with the following code, which you could use as the basis to explore more fully (EXPLAIN QUERY PLAN your_query being a tool to use):-
DROP TABLE IF EXISTS my_table;
DROP INDEX IF EXISTS index_time;
DROP INDEX IF EXISTS index_name;
DROP INDEX IF EXISTS index_name_and_time;
CREATE TABLE IF NOT EXISTS my_table (time INTEGER, name TEXT, other TEXT);
CREATE INDEX IF NOT EXISTS index_time ON my_table (time); -- INDEX 1
-- CREATE INDEX IF NOT EXISTS index_name ON my_table (name); -- INDEX 2
-- CREATE INDEX index_name_and_time ON my_table (name,time); -- INDEX 3
EXPLAIN QUERY PLAN
SELECT * FROM my_table; -- QUERY 1
-- EXPLAIN QUERY PLAN
-- SELECT time, name, other FROM my_table -- QUERY 2
-- EXPLAIN QUERY PLAN
-- SELECT time, name, other FROM my_table ORDER BY time, name; -- QUERY 3
-- EXPLAIN QUERY PLAN
-- SELECT time, name, other FROM my_table ORDER BY name, time; -- QUERY 4
The following results can be obtained :-
First two Queries, no advantage, just disadvantage.
Having no indexes through to having all 3 makes no difference to the first 2 queries (basically the same). None use any of the indexes when 0,1,2 or 3 indexes are available. They use SCAN TABLE my_table
The 3rd Query
Without any indexes then SCAN TABLE my_table and USE TEMP B-TREE FOR ORDER BY
With just the first index SCAN TABLE my_table USING INDEX index_time and USE TEMP B-TREE FOR RIGHT PART OF ORDER BY.
With the 1st and 2nd SCAN TABLE my_table USING INDEX index_time and USE TEMP B-TREE FOR RIGHT PART OF ORDER BY.
With just the 2nd SCAN TABLE my_table and USE TEMP B-TREE FOR ORDER BY
With all 3 SCAN TABLE my_table USING INDEX index_time and USE TEMP B-TREE FOR RIGHT PART OF ORDER BY
With just the 3rd SCAN TABLE my_table and USE TEMP B-TREE FOR ORDER BY
The 4th query
Without any SCAN TABLE my_table and USE TEMP B-TREE FOR ORDER BY
With 1 SCAN TABLE my_table and USE TEMP B-TREE FOR ORDER BY
With 1 and 2 SCAN TABLE my_table USING INDEX index_name and USE TEMP B-TREE FOR RIGHT PART OF ORDER BY
With 2 SCAN TABLE my_table USING INDEX index_name and USE TEMP B-TREE FOR RIGHT PART OF ORDER BY
With 1,2 and 3 SCAN TABLE my_table USING INDEX index_name_and_time
With just 3 SCAN TABLE my_table USING INDEX index_name_and_time
Of course this is not factoring in timings as the tables are empty. The code above could easily be adapted to include data and thus then have timings applied. Note you also perhaps ant to consider effects other than running queries, such as insertions and deletions which would alter the indexes.
The Answer - It depends.
So at least from a index utilisation point of view it's quite clear that an index being useful or not is dependant upon the queries used.
The third index, (name, time) is redundant with (name).
You should probably drop the (name) index and just include (name, time) and (time) -- if those are the indexes that you think you need.

Oracle: Full text search with condition

I've created an Oracle Text index like the following:
create index my_idx on my_table (text) indextype is ctxsys.context;
And I can then do the following:
select * from my_table where contains(text, '%blah%') > 0;
But lets say we have a have another column in this table, say group_id, and I wanted to do the following query instead:
select * from my_table where contains(text, '%blah%') > 0 and group_id = 43;
With the above index, Oracle will have to search for all items that contain 'blah', and then check all of their group_ids.
Ideally, I'd prefer to only search the items with group_id = 43, so I'd want an index like this:
create index my_idx on my_table (group_id, text) indextype is ctxsys.context;
Kind of like a normal index, so a separate text search can be done for each group_id.
Is there a way to do something like this in Oracle (I'm using 10g if that is important)?
Edit (clarification)
Consider a table with one million rows and the following two columns among others, A and B, both numeric. Lets say there are 500 different values of A and 2000 different values of B, and each row is unique.
Now lets consider select ... where A = x and B = y
An index on A and B separately as far as I can tell do an index search on B, which will return 500 different rows, and then do a join/scan on these rows. In any case, at least 500 rows have to be looked at (aside from the database being lucky and finding the required row early.
Whereas an index on (A,B) is much more effective, it finds the one row in one index search.
Putting separate indexes on group_id and the text I feel only leaves the query generator with two options.
(1) Use the group_id index, and scan all the resulting rows for the text.
(2) Use the text index, and scan all the resulting rows for the group_id.
(3) Use both indexes, and do a join.
Whereas I want:
(4) Use the (group_id, "text") index to find the text index under the particular group_id and scan that text index for the particular row/rows I need. No scanning and checking or joining required, much like when using an index on (A,B).
Oracle Text
1 - You can improve performance by creating the CONTEXT index with FILTER BY:
create index my_idx on my_table(text) indextype is ctxsys.context filter by group_id;
In my tests the filter by definitely improved the performance, but it was still slightly faster to just use a btree index on group_id.
2 - CTXCAT indexes use "sub-indexes", and seem to work similar to a multi-column index. This seems to be the option (4) you're looking for:
begin
ctx_ddl.create_index_set('my_table_index_set');
ctx_ddl.add_index('my_table_index_set', 'group_id');
end;
/
create index my_idx2 on my_table(text) indextype is ctxsys.ctxcat
parameters('index set my_table_index_set');
select * from my_table where catsearch(text, 'blah', 'group_id = 43') > 0
This is likely the fastest approach. Using the above query against 120MB of random text similar to your A and B scenario required only 18 consistent gets. But on the downside, creating the CTXCAT index took almost 11 minutes and used 1.8GB of space.
(Note: Oracle Text seems to work correctly here, but I'm not familiar with Text and I can't gaurentee this isn't an inappropriate use of these indexes like #NullUserException said.)
Multi-column indexes vs. index joins
For the situation you describe in your edit, normally there would not be a significant difference between using an index on (A,B) and joining separate indexes on A and B. I built some tests with data similar to what you described and an index join required only 7 consistent gets versus 2 consistent gets for the multi-column index.
The reason for this is because Oracle retrieves data in blocks. A block is usually 8K, and an index block is already sorted, so you can probably fit the 500 to 2000 values in a few blocks. If you're worried about performance, usually the IO to read and write blocks is the only thing that matters. Whether or not Oracle has to join together a few thousand rows is an inconsequential amount of CPU time.
However, this doesn't apply to Oracle Text indexes. You can join a CONTEXT index with a btree index (a "bitmap and"?), but the performance is poor.
I'd put an index on group_id and see if that's good enough. You don't say how many rows we're talking about or what performance you need.
Remember, the order in which the predicates are handled is not necessarily the order in which you wrote them in the query. Don't try to outsmart the optimizer unless you have a real reason to.
Short version: There's no need to do that. The query optimizer is smart enough to decide what's the best way to select your data. Just create a btree index on group_id, ie:
CREATE INDEX my_group_idx ON my_table (group_id);
Long version: I created a script (testperf.sql) that inserts 136 rows of dummy data.
DESC my_table;
Name Null Type
-------- -------- ---------
ID NOT NULL NUMBER(4)
GROUP_ID NUMBER(4)
TEXT CLOB
There is a btree index on group_id. To ensure the index will actually be used, run this as a dba user:
EXEC DBMS_STATS.GATHER_TABLE_STATS('<YOUR USER HERE>', 'MY_TABLE', cascade=>TRUE);
Here's how many rows each group_id has and the corresponding percentage:
GROUP_ID COUNT PCT
---------------------- ---------------------- ----------------------
1 1 1
2 2 1
3 4 3
4 8 6
5 16 12
6 32 24
7 64 47
8 9 7
Note that the query optimizer will use an index only if it thinks it's a good idea - that is, you are retrieving up to a certain percentage of rows. So, if you ask it for a query plan on:
SELECT * FROM my_table WHERE group_id = 1;
SELECT * FROM my_table WHERE group_id = 7;
You will see that for the first query, it will use the index, whereas for the second query, it will perform a full table scan, since there are too many rows for the index to be effective when group_id = 7.
Now, consider a different condition - WHERE group_id = Y AND text LIKE '%blah%' (since I am not very familiar with ctxsys.context).
SELECT * FROM my_table WHERE group_id = 1 AND text LIKE '%ipsum%';
Looking at the query plan, you will see that it will use the index on group_id. Note that the order of your conditions is not important:
SELECT * FROM my_table WHERE text LIKE '%ipsum%' AND group_id = 1;
Generates the same query plan. And if you try to run the same query on group_id = 7, you will see that it goes back to the full table scan:
SELECT * FROM my_table WHERE group_id = 7 AND text LIKE '%ipsum%';
Note that stats are gathered automatically by Oracle every day (it's scheduled to run every night and on weekends), to continually improve the effectiveness of the query optimizer. In short, Oracle does its best to optimize the optimizer, so you don't have to.
I do not have an Oracle instance at hand to test, and have not used the full-text indexing in Oracle, but I have generally had good performance with inline views, which might be an alternative to the sort of index you had in mind. Is the following syntax legit when contains() is involved?
This inline view gets you the PK values of the rows in group 43:
(
select T.pkcol
from T
where group = 43
)
If group has a normal index, and doesn't have low cardinality, fetching this set should be quick. Then you would inner join that set with T again:
select * from T
inner join
(
select T.pkcol
from T
where group = 43
) as MyGroup
on T.pkcol = MyGroup.pkcol
where contains(text, '%blah%') > 0
Hopefully the optimizer would be able to use the PK index to optimize the join and then appy the contains predicate only to the group 43 rows.

SQL Server Index Usage with an Order By

I have a table named Workflow. There are 38M rows in the table. There is a PK on the following columns:
ID: Identity Int
ReadTime: dateTime
If I perform the following query, the PK is not used. The query plan shows an index scan being performed on one of the nonclustered indexes plus a sort. It takes a very long time with 38M rows.
Select TOP 100 ID From Workflow
Where ID > 1000
Order By ID
However, if I perform this query, a nonclustered index (on LastModifiedTime) is used. The query plan shows an index seek being performed. The query is very fast.
Select TOP 100 * From Workflow
Where LastModifiedTime > '6/12/2010'
Order By LastModifiedTime
So, my question is this. Why isn't the PK used in the first query, but the nonclustered index in the second query is used?
Without being able to fish around in your database, there are a few things that come to my mind.
Are you certain that the PK is (id, ReadTime) as opposed to (ReadTime, id)?
What execution plan does SELECT MAX(id) FROM WorkFlow yield?
What about if you create an index on (id, ReadTime) and then retry the test, or your query?
Since Id is an identity column, having ReadTime participate in the index is superfluous. The clustered key already points to the leaf data. I recommended you modify your indexes
CREATE TABLE Workflow
(
Id int IDENTITY,
ReadTime datetime,
-- ... other columns,
CONSTRAINT PK_WorkFlow
PRIMARY KEY CLUSTERED
(
Id
)
)
CREATE INDEX idx_LastModifiedTime
ON WorkFlow
(
LastModifiedTime
)
Also, check that statistics are up to date.
Finally, If there are 38 million rows in this table, then the optimizer may conclude that specifying criteria > 1000 on a unique column is non selective, because > 99.997% of the Ids are > 1000 (if your identity seed started at 1). In order for an index to considered helpful, the optimizer must conclude that < 5% of the records would be selected. You can use an index hint to force the issue (as already stated by Dan Andrews). What is the structure of the non-clustered index that was scanned?