PostgreSQL Query without WHERE only ORDER BY and LIMIT doesn't use index - sql

I have a table that contains an 'id' column of type BIGSERIAL. I also have an index for this one column (sort order descending, BTREE, unique).
I often need to retrieve the last 10, 20, 30 entries from a table of millions of entries, like this:
SELECT * FROM table ORDER BY id DESC LIMIT 10
I would have thought it's a pretty clear case: there's an index for this particular field, sort order matches, and I need only 10 entries compared to millions in the whole table, this query definitely uses an index scan.
But it doesn't it does a sequential scan over the whole table.
I try to dig deeper, didn't find anything unusual. The Postgres doc at https://www.postgresql.org/docs/9.6/static/indexes-ordering.html says:
An important special case is ORDER BY in combination with LIMIT n: an
explicit sort will have to process all the data to identify the first
n rows, but if there is an index matching the ORDER BY, the first n
rows can be retrieved directly, without scanning the remainder at all.
But it still doesn't work. Does anybody have any pointers for me? Maybe I'm just not seeing the forrest for the trees anymore... :-(

Ok, saying it out loud and trying to gather more information to put into my question apparently made me see the forrest again, I found the actual problem. Further down in the doc I mentioned above is this sentence:
An index stored in ascending order with nulls first can satisfy either
ORDER BY x ASC NULLS FIRST or ORDER BY x DESC NULLS LAST depending on
which direction it is scanned in.
This was the problem. I specified the sort order in the index but I ignored the NULLS FIRST vs. LAST.
Postgres default is NULLS FIRST if you don't mention it explicitly in your query. So what Postgres found was the combination ORDER BY DESC NULLS FIRST which wasn't covered by my index. The combination of both SORT ORDER and NULLS is what matters.
The 2 possible solutions:
Either mention NULLS FIRST/LAST accordingly in the query so that it matches the index
...or change the index to NULLS FIRST (which is what I did)
Now Postgres is doing a proper index scan and only touches 10 elements during the query, not all of them.

If you need to get last 10 entries in table you can use this:
SELECT *
FROM table
WHERE id >= (SELECT MAX(id) FROM table) - 10
ORDER BY id DESC
And similarly for 20 and 30 entries.
This looks not so clear, but works fast as long as you have index for 'id' column.

Related

Strange behavior when doing where and order by query in postgres

Background: A large table, 50M+, all column in query is indexed.
when I do a query like this:
select * from table where A=? order by id DESC limit 10;
In statement, A, id are both indexed.
Now confusing things happen:
the more rows where returned, the less time whole sql cost
the less rows where returned, the more time whole sql cost
I have a guess here: postgres do the order by first, and then where , so it cost more time to find 10 row in the orderd index when target rowset is small(like find 10 particular sand on beach); oppositeļ¼Œ if target rowset is large, it's easy to find the first 10.
Is it right? Or there are some other reason for this?
Final question: How to optimize this situation?
It can either use the index on A to apply the selectivity, then sort on "id" and apply the limit. Or it can read them already in order using the index on "id", then filter out the ones that meet the A condition until it finds 10 of them. It will choose the one it thinks is faster, and sometimes it makes the wrong choice.
If you had a multi-column index, on (A,id) it could use that one index to do both things, get the selectivity on A and still fetch the already in order by "id", at the same time.
Do you know PGAdmin? With "explain verbose" before your statement, you can check how the query is executed (meaning the order of the operators). Usually first happens the filter and only afterwards the sorting...

hsqldb: do I have to ORDER BY to ensure consistent selection order?

I have a table containing samples. The inserted samples are already naturally ordered by the timestamp.
My question is this - when I SELECT from the table do I have to use the ORDER BY clause to ensure the fetched samples are ordered by the timestamp?
Rows in a relation database are NOT sorted (Picture them as balls in a basket. Which one is the "first"?)
The only way (really, the only) to get a consistently sorted result is to use ORDER BY.
You cannot rely on side effects of joins, group by. UNION, index retrieval or similar operators. They will never guarantee an order. The DBMS is free to choose to return the rows in whatever order it thinks is the fastest unless you specify an ORDER BY.
If an HSQLDB table T has a column C as primary key, or has any index on that column,
SELECT FROM T ORDER BY C
will return ordered rows without extra ORDER BY processing.
If there is a condition on the select, which uses an index on a different column, you can still force the use of the index for ORDER BY:
SELECT FROM T WHERE <some condition> ORDER BY C USING INDEX
But in this case, you should only use USING INDEX if most of the rows of the table will be returned. Otherwise it is better to leave the engine use the other index to reduce the table scan time.
USING INDEX is ignored if there is no index to use for ORDER BY.

Two questions on PostgreSQL performance

1) What is the best way to implement paging in PostgreSQL?
Assume we need to implement paging. The simplest query is select * from MY_TABLE order by date_field DESC limit 10 offset 20. As far as I understand, we have 2 problems here: in case the dates may have duplicated values every run of this query may return different results and the more offset value is the longer the query runs. We have to provide additional column which is date_field_index:
--date_field--date_field_index--
12-01-2012 1
12-01-2012 2
14-01-2012 1
16-01-2012 1
--------------------------------
Now we can write something like
create index MY_INDEX on MY_TABLE (date_field, date_field_index);
select * from MY_TABLE where date_field=<last_page_date and not (date_field_index>=last_page_date_index and date_field=last+page_date) order by date_field DESC, date_field_index DESC limit 20;
..thus using the where clause and corresponding index instead of offset. OK, now the questions:
1) is this the best way to improve the initial query?
2) how can we populate that date_field_index field? we have to provide some trigger for this?
3) We should not use RowNumber() functions in Postgres because they are not using indexes and thus very slow. Is it correct?
2) Why column order in concatenated index is not affecting performance of the query?
My measurements show, that while searching using concatenated index (index consisting of 2 and more columns) there is no difference if we place the most selective column to the first place - or if we place it to the end. Why? If we place the most selective column to the first place - we run through a shorter range of the found rows which should have impact on performance. Am I right?
Use the primary key to untie in instead of the date_field_index column. Otherwise explain why that is not an option.
order by date_field DESC, "primary_key_column(s)" DESC
The combined index with the most unique column first is the best performer, but it will not be used if:
the distinct values are more than a few percent of the table
there aren't enough rows to make it worth
the range of dates is not small enough
What is the output of explain my_query?

how to reverse mysql table

I need to display what the table contains from the freshest data to the oldest. Something like this doesn't work:
SELECT * FROM table ORDER BY DESC;. I know its becouse after ORDER BY should be name of the column. But I want to just reverse normal order (by normal I mean from the oldest to the freshest data). How to do this?
In your query the DESC stands for descending, the reverse is ascending, or:
SELECT * FROM table ORDER BY column ASC;
btw, if you do not specify a column, what you call "normal order" really is random unless you specify an ordering.
The "normal order" is not always from oldest to freshest, since some records may be deleted and then these are replaced with the new ones. It means that the "natural order" may appear to be somewhat "random" with the freshest items being in the middle of the dataset.
You need to add a column for an insertion date or an incrementing key.
You can't rely on the physical storage pattern to give you a correct ordering. According to the MySQL documentation, there is no guarantee that rows returned will be in any particular order.
"...the result rows are displayed in no
particular order. It is often easier
to examine query output when the rows
are sorted in some meaningful way. To
sort a result, use an ORDER BY clause."
http://dev.mysql.com/doc/refman/5.0/en/sorting-rows.html
You could create a new field of type timestamp and set the default value to CURRENT_TIMESTAMP, then then ORDER BY on that field.
The physical ordering of the records in a table are not guaranteed to match the sequence in which they were created. To do this, you will need to find or create a field you can sort on. A 'create date' field, or perhaps an id value which increases as new records are added (like an order id or something).
if you have autoincremented ID's , so maybe
order by id desc ? :)

Slow MySQL retreival of the row with the largest value in indexed column

I have a SQL table readings something like:
id int
client_id int
device_id int
unique index(client_id, device_id)
I do not understand why the following query is so slow:
SELECT client_id FROM `readings` WHERE device_id = 10 ORDER BY client_id DESC LIMIT 1
My understanding with the index is that mysql keeps an ordered list (one property of a btree) of each row in the table sorted first by client_id and then by device_id. When I execute an explain on this query it says that it will use the index, but that it will need to look at every row. This makes sense since in the worst case, there may only be one row with device_id = 10 and that may also be the row with the smallest client_id and thus at the end of it's search. However, in practice, this is not true. My table has ~10 million rows, and rows with device_id = 10 are spread fairly evenly throughout that table. Why then doesn't MySQL start at the end of the index and scan until it finds the first row with device_id = 10, stop and return that value? It does not seem possible that this is what is happening since the query takes ~30 seconds to execute.
Is it that my unique key is implemented as a hash somehow and thus not accessible in a list form? PHPMyAdmin is telling me that it is implemented as a b-tree, which makes me think that it should be able to do the scan as I mentioned above and quit with the first instance.
Where is my error and how can I make this query execute more quickly?
Thanks
Try switching the column order in the index:
unique index(device_id, client_id)
Since you are filtering on device_id, you would want that to be the first column in the index.
First, I'm assuming that you have good statistics for this table. If not, you'll want to analyze the table to make sure the optimizer can figure out what the best option is.
Here's another approach you could try that might work better. I could be that MySQL is not understanding your intent well enough to optimize correctly:
SELECT MAX(client_id) from readings where device_id = 10
Otherwise you could modify the index to be by device_id first, then client_id. Or you could add another index by just device_id.
You have a compound index on (client_id, device_id), these will(more or less) be concatenated for the purpose of indexing and the index will only be considered if you use the
first of the column(s). Your query is using 'device_id' which is the last of them, you could provide a separate index on that column, or swap the columns around in the index.
Also, check the output of EXPLAIN on your queries.