Postgres ignoring clustered index on date query - sql

I have a large table that I run queries like select date_att > date '2001-01-01' on regularly. I'm trying to increase the speed of these queries by clustering the table on date_att, but when I run those queries through explain analyze it still chooses to sequentially scan the table, even on queries as simple as SELECT date_att from table where date_att > date '2001-01-01'. Why is this the case? I understand that since the query returns a large portion of the table, the optimizer will ignore the index, but since the table is clustered by that attribute, shouldn't it be able to really quickly binary search through the table to the point where date > '2001-01-01' and return all results after that? This query still takes as much time as without the clustering.

It seems like you are confusing two concepts:
PostgreSQL clustering of a table
Clustering a table according to an index in PostgreSQL aligns the order of table rows (stored in a heap table) to the order in the index at the time of clustering. From the docs:
Clustering is a one-time operation: when the table is subsequently
updated, the changes are not clustered.
http://www.postgresql.org/docs/9.3/static/sql-cluster.html
Clustering potentially (often) improves query speed for range queries because the selected rows are stored nearby in the heap table by coincidence. There is nothing that guarantees this order! Consequently the optimizer cannot assume that it is true.
E.g. if you insert a new row that fulfills your where clause it might be inserted at any place in the table — e.g. where rows for 1990 are stored. Hence, this assumtion doesn't hold true:
but since the table is clustered by that attribute, shouldn't it be able to really quickly binary > search through the table to the point where date > '2001-01-01' and return all results after that?
This brings us to the other concept you mentioned:
Clustered Indexes
This is something completely different, not supported by PostgreSQL at all but by many other databases (SQL Server, MySQL with InnoDB and also Oracle where it is called 'Index Organized Table').
In that case, the table data itself is stored in an index structure — there is no separate heap structure! As it is an index, the order is also maintained for each insert/update/delete. Hence your assumption would hold true and indeed I'd expect the above mentioned database to behave as you would expect it (given the date column is the clustering key!).
Hope that clarifies it.

Related

PostgreSQL Index Isn't Used on Query

so I have PostgreSQL database with table that track the movement and fuel level of equipment with millions of rows
and to make query faster when I create index on time column with this command:
create index gpsapi__index__time on gpsapi (time);
When I tried to run simple command with "EXPLAIN ANALYZE" like this
EXPLAIN ANALYZE
SELECT *
FROM gpsapi g
WHERE g.time >= NOW() - '1 months'::interval;
it doesn't show that the query uses the index I created
Do you know how to solve this? Thanks!
If you read the execution plan closely, you'll see that Postgres is telling you that out of about 6 million records, 5.5 million matched (> 90%). Based on statistics, it is likely that Postgres realized that it would be returning a large percentage of total records in the table, and that it would be faster to just forgo using the index and scan the entire table.
The concept to understand here is that, while the index you defined does potentially let Postgres throw away non matching records very quickly, it actually increases the time needed to lookup the values in SELECT *. The reason for this is that, upon hitting the leaf node in the index, Postgres must then do a seek back to the clustered index/table to find the column values. Assuming your query would return most of the table, it would be faster to just scan the table directly.
This being said, there is nothing at all inherently wrong with your index. If your query used a more narrow range, or searched for a specific timestamp, such that the expected result set were sufficiently small, then Postgres likely would have used the index.

optimize query with column in where clause

I have an sql query which fetch the first N rows in a table which is designed as a low-level queue.
select top N * from my_table where status = 0 order by date asc
The intention behind this query is as follows:
First, this question is intended to be database agnostic, as my implementation will support sql server, oracle, DB2 and sybase. The sql syntax above of "top N" is just an example.
The table can contain millions of rows.
N is a relatively small number in comparison, e.g. 100.
status is 0 when the row is in the queue. Later it is changed to 1 to indicate that it is in processing. After processing it is deleted. So it is expected that at least 90% of the rows in the table will be with status 0.
rows in the table should be fetched according to their date, hence the order by clause.
What is the optimal index to make this query works fastest?
I initially thought the index should be on (date, status), but I am not sure about it anymore. Since the status column will contain mostly zeros, is there an added-value to it? Will it be sufficient to index by (date) alone?
Or maybe it should be (status, date)?
I don't think there is an efficient solution that will be RDMS independent. For example, Oracle has bitmap indexes, SQLServer has partial indexes, and I don't see reasons not to use them if, for instance, Mysql or Sqlite has nothing similar. Also, historically SQLServer implements clustered tables (or IOT in Oracle world) way better than Oracle does, so having clustered index on date column may work perfectly for SQLServer, but not for Oracle.
I'd rather change approach a bit. If you say 90% of rows don't satisfy status=0 condition, why not try refactoring schema, and adding a new table (or materialized view) that holds only records you are interested in ? The number of new programmable objects required for keeping that table up-to-date and merging data with original table is relatively small even if RDMS doesn't support materialized view directly. Also, if it's possible to redesign underlying logic, so rows never updated, only inserted or deleted, then it will help avoiding lock contentions , and as a result , the whole system will have a better performance .
Have a clustered index on Date and a non clustered index on Status.

How do i optimize this query?

I have a very specific query. I tried lots of ways but i couldn't reach the performance i want.
SELECT *
FROM
items
WHERE
user_id=1
AND
(item_start < 20000 AND item_end > 30000)
i created and index on user_id, item_start, item_end
this didn't work and i dropped all indexes and create new indexes
user_id, (item_start, item_end)
also this didn't work.
(user_id, item_start and item_end are int)
edit: database is MySQL 5.1.44, engine is InnoDB
UPDATE: per your comment below, you need all the columns in the query (hence your SELECT *). If that's the case, you have a few options to maximize query performance:
create (or change) your clustered index to be on item_user_id, item_start, item_end. This will ensure that as few rows as possible are examined for each query. Per my original answer below, this approach may speed up this particular query but may slow down others, so you'll need to be careful.
if it's not practical to change your clustered index, you can create a non-clustered index on item_user_id, item_start, item_end and any other columns your query needs. This will slow down inserts somewhat, and will double the storage required for your table, but will speed up this particular query.
There are always other ways to increase performance (e.g. by reducing the size of each row) but the primary way is to decrease the number of rows which must be accessed and to increase the % of rows which are accessed sequentially rather than randomly. The indexing suggestions above do both.
ORIGINAL ANSWER BELOW:
Without knowing the exact schema or query plan, the main performance problem with this query is that SELECT * forces a lookup back to your clustered index for every row. If there are large numbers of matching rows for a particular user ID and if your clustered index's first column is not item_user_id, then this will likley be a very inefficient operation because your disk will be trying to fetch lots of randomly distributed rows from teh clustered inedx.
In other words, even thouggh filtering the rows you want is fast (because of your index), actually fetching the data is slower. .
If, however, your clustered index is ordered by item_user_id, item_start, item_end then that should speed things up. Note that this is not a panacea, since if you have other queries which depend on different ordering, or if you're inserting rows in a differnet order, you could end up slowing down other queries.
A less impactful solution would be to create a covering index which contains only the columns you want (also ordered by item_user_id, item_start, item_end, and then add the other cols you need). THen change your query to only pull back the cols you need, instead of using SELECT *.
If you could post more info about the DBMS brand and version, and the schema of your table, and we can help with more details.
Do you need to SELECT *?
If not, you can create a index on user_id, item_start, item_end with the fields you need in the SELECT-part as included columns. This all assuming you're using Microsoft SQL Server 2005+

Does an index on a unique field in a table allow a select count(*) to happen instantly? If not why not?

I know just enough about SQL tuning to get myself in trouble. Today I was doing EXPLAIN plan on a query and I noticed it was not using indexes when I thought it probably should. Well, I kept doing EXPLAIN on simpler and simpler (and more indexable in my mind) queries, until I did EXPLAIN on
select count(*) from table_name
I thought for sure this would return instantly and that the explain would show use of an index, as we have many indexes on this table, including an index on the row_id column, which is unique. Yet the explain plan showed a FULL table scan, and it took several seconds to complete. (We have 3 million rows in this table).
Why would oracle be doing a full table scan to count the rows in this table? I would like to think that since oracle is indexing unique fields already, and having to track every insert and update on that table, that it would be caching the row count somewhere. Even if it's not, wouldn't it be faster to scan the entire index than to scan the entire table?
I have two theories. Theory one is that I am imagining how indexes work incorrectly. Theory two is that some setting or parameter somewhere in our oracle setup is messing with Oracle's ability to optimize queries (we are on oracle 9i). Can anyone enlighten me?
Oracle does not cache COUNT(*).
MySQL with MyISAM does (can afford this), because MyISAM is transactionless and same COUNT(*) is visible by anyone.
Oracle is transactional, and a row deleted in other transaction is still visible by your transaction.
Oracle should scan it, see that it's deleted, visit the UNDO, make sure it's still in place from your transaction's point of view, and add it to the count.
Indexing a UNIQUE value differs from indexing a non-UNIQUE one only logically.
In fact, you can create a UNIQUE constraint over a column with a non-unique index defined, and the index will be used to enforce the constraint.
If a column is marked as non-NULL, the an INDEX FAST FULL SCAN over this column can be used for COUNT.
It's a special access method, used for cases when the index order is not important. It does not traverse the B-Tree, but instead just reads the pages sequentially.
Since an index has less pages than the table itself, the COUNT can be faster with an INDEX_FFS than with a FULL
It is certainly possible for Oracle to satisfy such a query with an index (specifically with an INDEX FAST FULL SCAN).
In order for the optimizer to choose that path, at least two things have to be true:
Oracle has to be certain that every row in the table is represented in the index -- basically, that there are no NULL entries that would be missing from the index. If you have a primary key this should be guaranteed.
Oracle has to calculate the cost of the index scan as lower than the cost of a table scan. I don't think it necessarily true to assume that an index scan is always cheaper.
Possibly, gathering statistics on the table would change the behavior.
Expanding a little on the "transactions" reason. When a database supports transactions, at any point in time there might be records in different states, even in a "deleted" state. If a transaction fails, the states are rolled back.
A full table scan is done so that the current "version" of each record can be accessed for that point in time.
MySQL MyISAM doesn't have this problem since it uses table locking, instead of record locking required for transactions, and caches the record count. So it's always instantlyy returned. InnoDB under MySQL works the same as Oracle, but returns and "estimate".
You may be able to get a quicker query by counting the distinct values on the primary key, then only the index would be accessed.

Indexing nulls for fast searching on DB2

It's my understanding that nulls are not indexable in DB2, so assuming we have a huge table (Sales) with a date column (sold_on) which is normally a date, but is occasionally (10% of the time) null.
Furthermore, let's assume that it's a legacy application that we can't change, so those nulls are staying there and mean something (let's say sales that were returned).
We can make the following query fast by putting an index on the sold_on and total columns
Select * from Sales
where
Sales.sold_on between date1 and date2
and Sales.total = 9.99
But an index won't make this query any faster:
Select * from Sales
where
Sales.sold_on is null
and Sales.total = 9.99
Because the indexing is done on the value.
Can I index nulls? Maybe by changing the index type? Indexing the indicator column?
From where did you get the impression that DB2 doesn't index NULLs? I can't find anything in documentation or articles supporting the claim. And I just performed a query in a large table using a IS NULL restriction involving an indexed column containing a small fraction of NULLs; in this case, DB2 certainly used the index (verified by an EXPLAIN, and by observing that the database responded instantly instead of spending time to perform a table scan).
So: I claim that DB2 has no problem with NULLs in non-primary key indexes.
But as others have written: Your data may be composed in a way where DB2 thinks that using an index will not be quicker. Or the database's statistics aren't up-to-date for the involved table(s).
I'm no DB2 expert, but if 10% of your values are null, I don't think an index on that column alone will ever help your query. 10% is too many to bother using an index for -- it'll just do a table scan. If you were talking about 2-3%, I think it would actually use your index.
Think about how many records are on a page/block -- say 20. The reason to use an index is to avoid fetching pages you don't need. The odds that a given page will contain 0 records that are null is (90%)^20, or 12%. Those aren't good odds -- you're going to need 88% of your pages to be fetched anyway, using the index isn't very helpful.
If, however, your select clause only included a few columns (and not *) -- say just salesid, you could probably get it to use an index on (sold_on,salesid), as the read of the data page wouldn't be needed -- all the data would be in the index.
The rule of thumb is that an index is useful for values up on to 15% of the records. ... so an index might be useful here.
If DB2 won't index nulls, then I would suggest adding a boolean field, IsSold, and set it to true whenever the sold_on date gets set (this could be done in a trigger).
That's not the nicest solution, but it might be what you need.
Troels is correct; even rows with a SOLD_ON value of NULL will benefit from an index on that column. If you're doing ranged searches on SOLD_ON, you may benefit even more by creating a clustered index that begins with SOLD_ON. In this particular example, it may not require much additional overhead to maintain the clustering order based on SOLD_ON, since newer rows added will most likely have a newer SOLD_ON date.