PostgreSQL Index Isn't Used on Query - sql

so I have PostgreSQL database with table that track the movement and fuel level of equipment with millions of rows
and to make query faster when I create index on time column with this command:
create index gpsapi__index__time on gpsapi (time);
When I tried to run simple command with "EXPLAIN ANALYZE" like this
EXPLAIN ANALYZE
SELECT *
FROM gpsapi g
WHERE g.time >= NOW() - '1 months'::interval;
it doesn't show that the query uses the index I created
Do you know how to solve this? Thanks!

If you read the execution plan closely, you'll see that Postgres is telling you that out of about 6 million records, 5.5 million matched (> 90%). Based on statistics, it is likely that Postgres realized that it would be returning a large percentage of total records in the table, and that it would be faster to just forgo using the index and scan the entire table.
The concept to understand here is that, while the index you defined does potentially let Postgres throw away non matching records very quickly, it actually increases the time needed to lookup the values in SELECT *. The reason for this is that, upon hitting the leaf node in the index, Postgres must then do a seek back to the clustered index/table to find the column values. Assuming your query would return most of the table, it would be faster to just scan the table directly.
This being said, there is nothing at all inherently wrong with your index. If your query used a more narrow range, or searched for a specific timestamp, such that the expected result set were sufficiently small, then Postgres likely would have used the index.

Related

PostgreSQL EXPLAIN: How do I see a plan AS IF certain tables had millions of rows?

This is a question about PostgreSQL's EXPLAIN command. This command shows you how the optimizer will execute your SQL based on the data in your tables. We are not in prod yet, so all of our tables have ~100 rows or less. Is there a way to get EXPLAIN to tell me what the explain plan would look like if certain tables had millions of rows instead of tens of rows?
I could generate the data somehow, but then I'd have to clear it and wait for it to be created. If that's the only way, I'll accept that as an answer, though.
I don't think so. Postgresql collects some statistics related to the table that the optimizer will use to choose the best plan. These statistics are not related to simply how many rows a table contains but they will depends on the values/data too.
From the postgres documentation:
the query planner needs to estimate the number of rows retrieved by a
query in order to make good choices of query plans.
What does it mean that? Suppose we have an indexed column called foo, without a non-unique constraint. Suppose you have the following simple query:
SELECT * FROM test_table WHERE foo = 5
Postgresql will have to choose between different index scans:
sequential scan
index scan
bitmap scan
It will choose the type of scan based on how many rows it thinks to retrieve from the query. How does it know how many rows will be retrieved before running the query? With the statistics that it collects. These statistics are based on the VALUES/DATA inside your table. Suppose you have a table with 1 million of rows and 90% of them have foo = 5. Postgresql may be know that, because it could have collected some statistics about the distribution of your data. So it will chose a sequential scan, because according to its cost model, this scan is the cheapest one.
In the end, it will would be not enough generate data, but you should generate values that will represent the reality (the data that you will have in the future).
You can already build your database creating some indexes (based on the query that you will do) to have already good performance in production. If it will be not enough you will have to tune your indexes after you go into production.

Postgres ignoring clustered index on date query

I have a large table that I run queries like select date_att > date '2001-01-01' on regularly. I'm trying to increase the speed of these queries by clustering the table on date_att, but when I run those queries through explain analyze it still chooses to sequentially scan the table, even on queries as simple as SELECT date_att from table where date_att > date '2001-01-01'. Why is this the case? I understand that since the query returns a large portion of the table, the optimizer will ignore the index, but since the table is clustered by that attribute, shouldn't it be able to really quickly binary search through the table to the point where date > '2001-01-01' and return all results after that? This query still takes as much time as without the clustering.
It seems like you are confusing two concepts:
PostgreSQL clustering of a table
Clustering a table according to an index in PostgreSQL aligns the order of table rows (stored in a heap table) to the order in the index at the time of clustering. From the docs:
Clustering is a one-time operation: when the table is subsequently
updated, the changes are not clustered.
http://www.postgresql.org/docs/9.3/static/sql-cluster.html
Clustering potentially (often) improves query speed for range queries because the selected rows are stored nearby in the heap table by coincidence. There is nothing that guarantees this order! Consequently the optimizer cannot assume that it is true.
E.g. if you insert a new row that fulfills your where clause it might be inserted at any place in the table — e.g. where rows for 1990 are stored. Hence, this assumtion doesn't hold true:
but since the table is clustered by that attribute, shouldn't it be able to really quickly binary > search through the table to the point where date > '2001-01-01' and return all results after that?
This brings us to the other concept you mentioned:
Clustered Indexes
This is something completely different, not supported by PostgreSQL at all but by many other databases (SQL Server, MySQL with InnoDB and also Oracle where it is called 'Index Organized Table').
In that case, the table data itself is stored in an index structure — there is no separate heap structure! As it is an index, the order is also maintained for each insert/update/delete. Hence your assumption would hold true and indeed I'd expect the above mentioned database to behave as you would expect it (given the date column is the clustering key!).
Hope that clarifies it.

Optimize query on timestamp column in Postgresql 8.x

Lets suppose you have and orders table, this table contains a timestamp column indicating the creation time of the orders. A normal query would be to obtain the orders between two dates. Does anybody know how to optimize this query because creating an index on the timestamp column has no effect as shown by EXPLAIN ANALYZE.
Usually indexes are used, but only if the table is properly analyzed (VACUUM ANALYZE or just ANALYZE), and if the table size is large enough that index scans are faster than sequential scans.
An index should work. I suspect its not working for you because your table is either tiny (PostgreSQL almost never uses indices for tiny tables), or you haven't done an analyze on it.

Will I save any time on a INDEX that SELECTs only once?

On DBD::SQLite of SQLite3
If I am going to query a SELECT only once.
Should I CREATE a INDEX first and then query the SELECT
or
just query the SELECT without an INDEX,
which is faster ?
If need to be specified, the col. to be index on is a INTEGER of undef or 1, just these 2 possibilities.
Building an index takes longer than just doing a table scan. So, if your single query — which you're only running once — is just a table scan, adding an index will be slower.
However, if your single query is not just a table scan, adding the index may be faster. For example, without an index, the database may perform a join as many table scans, once for each joined row. Then the index would probably be faster.
I'd say to benchmark it, but that sounds silly for a one-off query that you're only ever going to run once.
If you consider setting and index on a column that only has two possible values it's not worth the effort as index will give very little improvement. Indexes are useful on a columns that has a high degree of uniqueness and are frequently queried for a certain value or range. On the other hard indexes make inserting and updating slower so in this case you should skip it.

Does an index on a unique field in a table allow a select count(*) to happen instantly? If not why not?

I know just enough about SQL tuning to get myself in trouble. Today I was doing EXPLAIN plan on a query and I noticed it was not using indexes when I thought it probably should. Well, I kept doing EXPLAIN on simpler and simpler (and more indexable in my mind) queries, until I did EXPLAIN on
select count(*) from table_name
I thought for sure this would return instantly and that the explain would show use of an index, as we have many indexes on this table, including an index on the row_id column, which is unique. Yet the explain plan showed a FULL table scan, and it took several seconds to complete. (We have 3 million rows in this table).
Why would oracle be doing a full table scan to count the rows in this table? I would like to think that since oracle is indexing unique fields already, and having to track every insert and update on that table, that it would be caching the row count somewhere. Even if it's not, wouldn't it be faster to scan the entire index than to scan the entire table?
I have two theories. Theory one is that I am imagining how indexes work incorrectly. Theory two is that some setting or parameter somewhere in our oracle setup is messing with Oracle's ability to optimize queries (we are on oracle 9i). Can anyone enlighten me?
Oracle does not cache COUNT(*).
MySQL with MyISAM does (can afford this), because MyISAM is transactionless and same COUNT(*) is visible by anyone.
Oracle is transactional, and a row deleted in other transaction is still visible by your transaction.
Oracle should scan it, see that it's deleted, visit the UNDO, make sure it's still in place from your transaction's point of view, and add it to the count.
Indexing a UNIQUE value differs from indexing a non-UNIQUE one only logically.
In fact, you can create a UNIQUE constraint over a column with a non-unique index defined, and the index will be used to enforce the constraint.
If a column is marked as non-NULL, the an INDEX FAST FULL SCAN over this column can be used for COUNT.
It's a special access method, used for cases when the index order is not important. It does not traverse the B-Tree, but instead just reads the pages sequentially.
Since an index has less pages than the table itself, the COUNT can be faster with an INDEX_FFS than with a FULL
It is certainly possible for Oracle to satisfy such a query with an index (specifically with an INDEX FAST FULL SCAN).
In order for the optimizer to choose that path, at least two things have to be true:
Oracle has to be certain that every row in the table is represented in the index -- basically, that there are no NULL entries that would be missing from the index. If you have a primary key this should be guaranteed.
Oracle has to calculate the cost of the index scan as lower than the cost of a table scan. I don't think it necessarily true to assume that an index scan is always cheaper.
Possibly, gathering statistics on the table would change the behavior.
Expanding a little on the "transactions" reason. When a database supports transactions, at any point in time there might be records in different states, even in a "deleted" state. If a transaction fails, the states are rolled back.
A full table scan is done so that the current "version" of each record can be accessed for that point in time.
MySQL MyISAM doesn't have this problem since it uses table locking, instead of record locking required for transactions, and caches the record count. So it's always instantlyy returned. InnoDB under MySQL works the same as Oracle, but returns and "estimate".
You may be able to get a quicker query by counting the distinct values on the primary key, then only the index would be accessed.