Extremely slow distinct query on indexed column

Extremely slow distinct query on indexed column - sql

In a Postgres database, I am querying distinct values of MY_DATE in a large table with 300 million rows. There are about 400 of them and the column MY_DATE is indexed.
Select distinct MY_DATE from MY_TABLE;
The query runs for 22 min.
The same query on my Oracle DB with the exact same data-set and the same index definition runs 11 seconds.
The query plan shows that the query is using the index:
EXPLAIN Select distinct MY_DATE from MY_TABLE LIMIT 200;
gives:
QUERY PLAN
Limit (cost=0.57..7171644.14 rows=200 width=8)
-> Unique (cost=0.57..15419034.24 rows=430 width=8)
-> Index Only Scan using idx_obsdate on my_table (cost=0.57..14672064.14 rows=298788038 width=8)
When I limit the results, the query can become much faster. Ee.g.
Select distinct MY_DATE from MY_TABLE LIMIT 5;
runs in sub-seconds.
but:
Select distinct MY_DATE from MY_TABLE LIMIT 50;
already takes minutes. Time seems to increase exponentially with the LIMIT clause.
I expect the Postgres query to run in seconds, as my OracleDB does.
20 minutes for an index scan - even for a large table - seems way off the mark.
Any suggestions what causes the issue and what I can do?

distinct values ... 300 million rows ... about 400 of them ... column ... indexed.
There are much faster techniques for this. Emulating a loose index scan (a.k.a. skip scan), and assuming my_date is defined NOT NULL (or we can ignore NULL values):
WITH RECURSIVE cte AS (
SELECT min(my_date) AS my_date
FROM my_table
UNION ALL
SELECT (SELECT my_date
FROM my_table
WHERE my_date > cte.my_date
ORDER BY my_date
LIMIT 1)
FROM cte
WHERE my_date IS NOT NULL
)
TABLE cte;
Related:
Optimize GROUP BY query to retrieve latest record per user
Using the index you mentioned it should finish in milliseconds.
Oracle DB ... 11 seconds.
Because Oracle has native index skip scans and Postgres does not. There are ongoing efforts to implement similar functionality in Postgres 12.
Currently (Postgres 11), while the index is used to good effect, even in an index-only scan, Postgres cannot skip ahead and has to read index tuples in sequence. Without LIMIT, the complete index has to be scanned. Hence we see in your EXPLAIN output:
Index Only Scan ... rows=298788038
The suggested new query achieves the same with reading 400 index tuples (one per distinct value). Big difference.
With LIMIT (and no ORDER BY!) like you tested, Postgres stops as soon as enough rows are retrieved. Increasing the limit has a linear effect. But if the number of rows per distinct value can vary, so does the added cost.

Related

Fast query in PostgreSQL

I have a very large database (~1TB), so running even a very simple query can take a very long time. Eg. for:
EXPLAIN select count(*) from users;
the cost is 44661683.87 disk page fetch. Hence making it very expensive to execute.
When I try to put a limit on the query like:
EXPLAIN select count(*) from users limit 10;
the cost of executing the query remains the same i.e 44661683.87 disk page fetch.
So (1) is it possible to execute a query on subset of data and then extrapolate to the rest of the table? The row size can be quickly found using something like:
SELECT reltuples AS approximate_row_count FROM pg_class WHERE relname = 'users';
Moreover, (2) is it possible to select a randomly distributed subset of rows?

is it possible to execute a query on subset of data and then extrapolate to the rest of the table
You could use the tablesample option:
select count(*) * 10
from the_table tablesample system (10);
tablesample system (10) will only scan 10 percent of the blocks of the table which should be quite fast. If you multiply the resulting row count with 10 you'll have an approximation(!) of the total number of rows. The smaller the sample size is the faster, this will be - but also less accurate.
I accuracy of the number depends on how much free space your table has because the 10% (or whatever sample size you choose) is based on the total number of blocks in the table. If there are many free (or half free) blocks, then the number will be less reliable.

select count(*) . . . is an aggregation query with no group by. It returns 1 row, so the limit has no impact.
You seem to want:
select count(*)
from (select u.*
from users u
limit 10
);
As for your second question, Postgres introduced tablesample in version 9.5. You can investigate that.

If you have a primary key index on the users table (or an index on another column), you can get it to use that index for an index-only scan which should result in a much better execution plan. But, strangely, it won't work with COUNT so you can do a SELECT DISTINCT in a subquery and then COUNT on an outer query to force it to use the index:
EXPLAIN SELECT COUNT(*) FROM (SELECT DISTINCT id FROM users) u;

select first rows in oracle without full table scan

I'm trying to perform the following query in Oracle.
select * from (select rownum r, account from fooTable) where r<5001;
It selects the 1st 5000 rows. I'm running into a problem that fooTable has a lot of data inside of it and this is really slowing down the query (35 million+ rows). According to the query analyzer it's performing a full table scan.
My question is, is there a way to speed up this statement? Since I'm only fetching the 1st N rows, is the full table scan necessary?
mj

I have found the /*+ FIRST_ROWS(n) */ hint to be very helpful in cases like this (such as for limiting pagination results). You replace n with whatever value you want.
select /*+ FIRST_ROWS(5000) */
account
from fooTable
where rownum <5000;
You still need the rownum predicate to limit rows, but the hint lets the optimizer know you only need a lazy fetch of n rows.

Indexed ORDER BY with LIMIT 1

I'm trying to fetch most recent row in a table. I have a simple timestamp created_at which is indexed. When I query ORDER BY created_at DESC LIMIT 1, it takes far more than I think it should (about 50ms on my machine on 36k rows).
EXPLAIN-ing claims that it uses backwards index scan, but I confirmed that changing the index to be (created_at DESC) does not change the cost in query planner for a simple index scan.
How can I optimize this use case?
Running postgresql 9.2.4.
Edit:
# EXPLAIN SELECT * FROM articles ORDER BY created_at DESC LIMIT 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.58 rows=1 width=1752)
-> Index Scan Backward using index_articles_on_created_at on articles (cost=0.00..20667.37 rows=35696 width=1752)
(2 rows)

Assuming we are dealing with a big table, a partial index might help:
CREATE INDEX tbl_created_recently_idx ON tbl (created_at DESC)
WHERE created_at > '2013-09-15 0:0'::timestamp;
As you already found out: descending or ascending hardly matters here. Postgres can scan backwards at almost the same speed (exceptions apply with multi-column indices).
Query to use this index:
SELECT * FROM tbl
WHERE created_at > '2013-09-15 0:0'::timestamp -- matches index
ORDER BY created_at DESC
LIMIT 1;
The point here is to make the index much smaller, so it should be easier to cache and maintain.
You need to pick a timestamp that is guaranteed to be smaller than the most recent one.
You should recreate the index from time to time to cut off old data.
The condition needs to be IMMUTABLE.
So the one-time effect deteriorates over time. The specific problem is the hard coded condition:
WHERE created_at > '2013-09-15 0:0'::timestamp
Automate
You could update the index and your queries manually from time to time. Or you automate it with the help of a function like this one:
CREATE OR REPLACE FUNCTION f_min_ts()
RETURNS timestamp LANGUAGE sql IMMUTABLE AS
$$SELECT '2013-09-15 0:0'::timestamp$$
Index:
CREATE INDEX tbl_created_recently_idx ON tbl (created_at DESC);
WHERE created_at > f_min_ts();
Query:
SELECT * FROM tbl
WHERE created_at > f_min_ts()
ORDER BY created_at DESC
LIMIT 1;
Automate recreation with a cron job or some trigger-based event. Your queries can stay the same now. But you need to recreate all indices using this function in any way after changing it. Just drop and create each one.
First ..
... test whether you are actually hitting the bottle neck with this.
Try whether a simple DROP index ... ; CREATE index ... does the job. Then your index might have been bloated. Your autovacuum settings may be off.
Or try VACUUM FULL ANALYZE to get your whole table plus indices in pristine condition and check again.
Other options include the usual general performance tuning and covering indexes, depending on what you actually retrieve from the table.

SQLite: COUNT slow on big tables

I'm having a performance problem in SQLite with a SELECT COUNT(*) on a large tables.
As I didn't yet receive a usable answer and I did some further testing, I edited my question to incorporate my new findings.
I have 2 tables:
CREATE TABLE Table1 (
Key INTEGER NOT NULL,
... several other fields ...,
Status CHAR(1) NOT NULL,
Selection VARCHAR NULL,
CONSTRAINT PK_Table1 PRIMARY KEY (Key ASC))
CREATE Table2 (
Key INTEGER NOT NULL,
Key2 INTEGER NOT NULL,
... a few other fields ...,
CONSTRAINT PK_Table2 PRIMARY KEY (Key ASC, Key2 ASC))
Table1 has around 8 million records and Table2 has around 51 million records, and the databasefile is over 5GB.
Table1 has 2 more indexes:
CREATE INDEX IDX_Table1_Status ON Table1 (Status ASC, Key ASC)
CREATE INDEX IDX_Table1_Selection ON Table1 (Selection ASC, Key ASC)
"Status" is required field, but has only 6 distinct values, "Selection" is not required and has only around 1.5 million values different from null and only around 600k distinct values.
I did some tests on both tables, you can see the timings below, and I added the "explain query plan" for each request (QP). I placed the database file on an USB-memorystick so i could remove it after each test and get reliable results without interference of the disk cache. Some requests are faster on USB (I suppose due to lack of seektime), but some are slower (table scans).
SELECT COUNT(*) FROM Table1
Time: 105 sec
QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Selection(~1000000 rows)
SELECT COUNT(Key) FROM Table1
Time: 153 sec
QP: SCAN TABLE Table1 (~1000000 rows)
SELECT * FROM Table1 WHERE Key = 5123456
Time: 5 ms
QP: SEARCH TABLE Table1 USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
Time: 16 sec
QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Status (Status=?) (~3 rows)
SELECT * FROM Table1 WHERE Selection = 'SomeValue' AND Key > 5123456 LIMIT 1
Time: 9 ms
QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Selection (Selection=?) (~3 rows)
As you can see the counts are very slow, but normal selects are fast (except for the 2nd one, which took 16 seconds).
The same goes for Table2:
SELECT COUNT(*) FROM Table2
Time: 528 sec
QP: SCAN TABLE Table2 USING COVERING INDEX sqlite_autoindex_Table2_1(~1000000 rows)
SELECT COUNT(Key) FROM Table2
Time: 249 sec
QP: SCAN TABLE Table2 (~1000000 rows)
SELECT * FROM Table2 WHERE Key = 5123456 AND Key2 = 0
Time: 7 ms
QP: SEARCH TABLE Table2 USING INDEX sqlite_autoindex_Table2_1 (Key=? AND Key2=?) (~1 rows)
Why is SQLite not using the automatically created index on the primary key on Table1 ?
And why, when he uses the auto-index on Table2, it still takes a lot of time ?
I created the same tables with the same content and indexes on SQL Server 2008 R2 and there the counts are nearly instantaneous.
One of the comments below suggested executing ANALYZE on the database. I did and it took 11 minutes to complete.
After that, I ran some of the tests again:
SELECT COUNT(*) FROM Table1
Time: 104 sec
QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Selection(~7848023 rows)
SELECT COUNT(Key) FROM Table1
Time: 151 sec
QP: SCAN TABLE Table1 (~7848023 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
Time: 5 ms
QP: SEARCH TABLE Table1 USING INTEGER PRIMARY KEY (rowid>?) (~196200 rows)
SELECT COUNT(*) FROM Table2
Time: 529 sec
QP: SCAN TABLE Table2 USING COVERING INDEX sqlite_autoindex_Table2_1(~51152542 rows)
SELECT COUNT(Key) FROM Table2
Time: 249 sec
QP: SCAN TABLE Table2 (~51152542 rows)
As you can see, the queries took the same time (except the query plan is now showing the real number of rows), only the slower select is now also fast.
Next, I create dan extra index on the Key field of Table1, which should correspond to the auto-index. I did this on the original database, without the ANALYZE data. It took over 23 minutes to create this index (remember, this is on an USB-stick).
CREATE INDEX IDX_Table1_Key ON Table1 (Key ASC)
Then I ran the tests again:
SELECT COUNT(*) FROM Table1
Time: 4 sec
QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Key(~1000000 rows)
SELECT COUNT(Key) FROM Table1
Time: 167 sec
QP: SCAN TABLE Table2 (~1000000 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
Time: 17 sec
QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Status (Status=?) (~3 rows)
As you can see, the index helped with the count(*), but not with the count(Key).
Finaly, I created the table using a column constraint instead of a table constraint:
CREATE TABLE Table1 (
Key INTEGER PRIMARY KEY ASC NOT NULL,
... several other fields ...,
Status CHAR(1) NOT NULL,
Selection VARCHAR NULL)
Then I ran the tests again:
SELECT COUNT(*) FROM Table1
Time: 6 sec
QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Selection(~1000000 rows)
SELECT COUNT(Key) FROM Table1
Time: 28 sec
QP: SCAN TABLE Table1 (~1000000 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
Time: 10 sec
QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Status (Status=?) (~3 rows)
Although the query plans are the same, the times are a lot better. Why is this ?
The problem is that ALTER TABLE does not permit to convert an existing table and I have a lot of existing databases which i can not convert to this form. Besides, using a column contraint instead of table constraint won't work for Table2.
Has anyone any idea what I am doing wrong and how to solve this problem ?
I used System.Data.SQLite version 1.0.74.0 to create the tables and to run the tests I used SQLiteSpy 1.9.1.
Thanks,
Marc

If you haven't DELETEd any records, doing:
SELECT MAX(_ROWID_) FROM "table" LIMIT 1;
will avoid the full-table scan.
Note that _ROWID_ is a SQLite identifier.

From http://old.nabble.com/count(*)-slow-td869876.html
SQLite always does a full table scan for count(*). It
does not keep meta information on tables to speed this
process up.
Not keeping meta information is a deliberate design
decision. If each table stored a count (or better, each
node of the B-tree stored a count) then much more updating
would have to occur on every INSERT or DELETE. This
would slow down INSERT and DELETE, even in the common
case where count(*) speed is unimportant.
If you really need a fast COUNT, then you can create
a trigger on INSERT and DELETE that updates a running
count in a separate table then query that separate
table to find the latest count.
Of course, it's not worth keeping a FULL row count if you
need COUNTs dependent on WHERE clauses (i.e. WHERE field1 > 0 and field2 < 1000000000).

This may not help much, but you can run the ANALYZE command to rebuild statistics about your database. Try running "ANALYZE;" to rebuild statistics about the entire database, then run your query again and see if it is any faster.

Do not count the stars, count the records! Or in other language, never issue
SELECT COUNT(*) FROM tablename;
use
SELECT COUNT(ROWID) FROM tablename;
Call EXPLAIN QUERY PLAN for both to see the difference. Make sure you have an index in place containing all columns mentioned in the WHERE clause.

On the matter of the column constraint, SQLite maps columns that are declared to be INTEGER PRIMARY KEY to the internal row id (which in turn admits a number of internal optimizations). Theoretically, it could do the same for a separately-declared primary key constraint, but it appears not to do so in practice, at least with the version of SQLite in use. (System.Data.SQLite 1.0.74.0 corresponds to core SQLite 3.7.7.1. You might want to try re-checking your figures with 1.0.79.0; you shouldn't need to change your database to do that, just the library.)

The output for the fast queries all start with the text "QP: SEARCH". Whilst those for the slow queries start with text "QP: SCAN", which suggests that sqlite is performing a scan of the entire table in order to generate the count.
Googling for "sqlite table scan count" finds the following, which suggests that using a full table scan to retrieve a count is just the way sqlite works, and is therefore probably unavoidable.
As a workaround, and given that status has only eight values, I wondered if you could get a count quickly using a query like the following?
select 1 where status=1
union
select 1 where status=2
...
then count the rows in the result. This is clearly ugly, but it might work if it persuades sqlite to run the query as a search rather than a scan. The idea of returning "1" each time is to avoid the overhead of returning real data.

Here's a potential workaround to improve the query performance. From the context, it sounds like your query takes about a minute and a half to run.
Assuming you have a date_created column (or can add one), run a query in the background each day at midnight (say at 00:05am) and persist the value somewhere along with the last_updated date it was calculated (I'll come back to that in a bit).
Then, running against your date_created column (with an index), you can avoid a full table scan by doing a query like SELECT COUNT(*) FROM TABLE WHERE date_updated > "[TODAY] 00:00:05".
Add the count value from that query to your persisted value, and you have a reasonably fast count that's generally accurate.
The only catch is that from 12:05am to 12:07am (the duration during which your total count query is running) you have a race condition which you can check the last_updated value of your full table scan count(). If it's > 24 hours old, then your incremental count query needs to pull a full day's count plus time elapsed today. If it's < 24 hours old, then your incremental count query needs to pull a partial day's count (just time elapsed today).

I had the same problem, in my situation VACUUM command helped. After its execution on database COUNT(*) speed increased near 100 times. However, command itself needs some minutes in my database (20 millions records). I solved this problem by running VACUUM when my software exits after main window destruction, so the delay doesn't make problems to user.

Oracle: Full text search with condition

I've created an Oracle Text index like the following:
create index my_idx on my_table (text) indextype is ctxsys.context;
And I can then do the following:
select * from my_table where contains(text, '%blah%') > 0;
But lets say we have a have another column in this table, say group_id, and I wanted to do the following query instead:
select * from my_table where contains(text, '%blah%') > 0 and group_id = 43;
With the above index, Oracle will have to search for all items that contain 'blah', and then check all of their group_ids.
Ideally, I'd prefer to only search the items with group_id = 43, so I'd want an index like this:
create index my_idx on my_table (group_id, text) indextype is ctxsys.context;
Kind of like a normal index, so a separate text search can be done for each group_id.
Is there a way to do something like this in Oracle (I'm using 10g if that is important)?
Edit (clarification)
Consider a table with one million rows and the following two columns among others, A and B, both numeric. Lets say there are 500 different values of A and 2000 different values of B, and each row is unique.
Now lets consider select ... where A = x and B = y
An index on A and B separately as far as I can tell do an index search on B, which will return 500 different rows, and then do a join/scan on these rows. In any case, at least 500 rows have to be looked at (aside from the database being lucky and finding the required row early.
Whereas an index on (A,B) is much more effective, it finds the one row in one index search.
Putting separate indexes on group_id and the text I feel only leaves the query generator with two options.
(1) Use the group_id index, and scan all the resulting rows for the text.
(2) Use the text index, and scan all the resulting rows for the group_id.
(3) Use both indexes, and do a join.
Whereas I want:
(4) Use the (group_id, "text") index to find the text index under the particular group_id and scan that text index for the particular row/rows I need. No scanning and checking or joining required, much like when using an index on (A,B).

Oracle Text
1 - You can improve performance by creating the CONTEXT index with FILTER BY:
create index my_idx on my_table(text) indextype is ctxsys.context filter by group_id;
In my tests the filter by definitely improved the performance, but it was still slightly faster to just use a btree index on group_id.
2 - CTXCAT indexes use "sub-indexes", and seem to work similar to a multi-column index. This seems to be the option (4) you're looking for:
begin
ctx_ddl.create_index_set('my_table_index_set');
ctx_ddl.add_index('my_table_index_set', 'group_id');
end;
/
create index my_idx2 on my_table(text) indextype is ctxsys.ctxcat
parameters('index set my_table_index_set');
select * from my_table where catsearch(text, 'blah', 'group_id = 43') > 0
This is likely the fastest approach. Using the above query against 120MB of random text similar to your A and B scenario required only 18 consistent gets. But on the downside, creating the CTXCAT index took almost 11 minutes and used 1.8GB of space.
(Note: Oracle Text seems to work correctly here, but I'm not familiar with Text and I can't gaurentee this isn't an inappropriate use of these indexes like #NullUserException said.)
Multi-column indexes vs. index joins
For the situation you describe in your edit, normally there would not be a significant difference between using an index on (A,B) and joining separate indexes on A and B. I built some tests with data similar to what you described and an index join required only 7 consistent gets versus 2 consistent gets for the multi-column index.
The reason for this is because Oracle retrieves data in blocks. A block is usually 8K, and an index block is already sorted, so you can probably fit the 500 to 2000 values in a few blocks. If you're worried about performance, usually the IO to read and write blocks is the only thing that matters. Whether or not Oracle has to join together a few thousand rows is an inconsequential amount of CPU time.
However, this doesn't apply to Oracle Text indexes. You can join a CONTEXT index with a btree index (a "bitmap and"?), but the performance is poor.

I'd put an index on group_id and see if that's good enough. You don't say how many rows we're talking about or what performance you need.
Remember, the order in which the predicates are handled is not necessarily the order in which you wrote them in the query. Don't try to outsmart the optimizer unless you have a real reason to.

Short version: There's no need to do that. The query optimizer is smart enough to decide what's the best way to select your data. Just create a btree index on group_id, ie:
CREATE INDEX my_group_idx ON my_table (group_id);
Long version: I created a script (testperf.sql) that inserts 136 rows of dummy data.
DESC my_table;
Name Null Type
-------- -------- ---------
ID NOT NULL NUMBER(4)
GROUP_ID NUMBER(4)
TEXT CLOB
There is a btree index on group_id. To ensure the index will actually be used, run this as a dba user:
EXEC DBMS_STATS.GATHER_TABLE_STATS('<YOUR USER HERE>', 'MY_TABLE', cascade=>TRUE);
Here's how many rows each group_id has and the corresponding percentage:
GROUP_ID COUNT PCT
---------------------- ---------------------- ----------------------
1 1 1
2 2 1
3 4 3
4 8 6
5 16 12
6 32 24
7 64 47
8 9 7
Note that the query optimizer will use an index only if it thinks it's a good idea - that is, you are retrieving up to a certain percentage of rows. So, if you ask it for a query plan on:
SELECT * FROM my_table WHERE group_id = 1;
SELECT * FROM my_table WHERE group_id = 7;
You will see that for the first query, it will use the index, whereas for the second query, it will perform a full table scan, since there are too many rows for the index to be effective when group_id = 7.
Now, consider a different condition - WHERE group_id = Y AND text LIKE '%blah%' (since I am not very familiar with ctxsys.context).
SELECT * FROM my_table WHERE group_id = 1 AND text LIKE '%ipsum%';
Looking at the query plan, you will see that it will use the index on group_id. Note that the order of your conditions is not important:
SELECT * FROM my_table WHERE text LIKE '%ipsum%' AND group_id = 1;
Generates the same query plan. And if you try to run the same query on group_id = 7, you will see that it goes back to the full table scan:
SELECT * FROM my_table WHERE group_id = 7 AND text LIKE '%ipsum%';
Note that stats are gathered automatically by Oracle every day (it's scheduled to run every night and on weekends), to continually improve the effectiveness of the query optimizer. In short, Oracle does its best to optimize the optimizer, so you don't have to.

I do not have an Oracle instance at hand to test, and have not used the full-text indexing in Oracle, but I have generally had good performance with inline views, which might be an alternative to the sort of index you had in mind. Is the following syntax legit when contains() is involved?
This inline view gets you the PK values of the rows in group 43:
(
select T.pkcol
from T
where group = 43
)
If group has a normal index, and doesn't have low cardinality, fetching this set should be quick. Then you would inner join that set with T again:
select * from T
inner join
(
select T.pkcol
from T
where group = 43
) as MyGroup
on T.pkcol = MyGroup.pkcol
where contains(text, '%blah%') > 0
Hopefully the optimizer would be able to use the PK index to optimize the join and then appy the contains predicate only to the group 43 rows.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas