Two questions on PostgreSQL performance - sql

1) What is the best way to implement paging in PostgreSQL?
Assume we need to implement paging. The simplest query is select * from MY_TABLE order by date_field DESC limit 10 offset 20. As far as I understand, we have 2 problems here: in case the dates may have duplicated values every run of this query may return different results and the more offset value is the longer the query runs. We have to provide additional column which is date_field_index:
--date_field--date_field_index--
12-01-2012 1
12-01-2012 2
14-01-2012 1
16-01-2012 1
--------------------------------
Now we can write something like
create index MY_INDEX on MY_TABLE (date_field, date_field_index);
select * from MY_TABLE where date_field=<last_page_date and not (date_field_index>=last_page_date_index and date_field=last+page_date) order by date_field DESC, date_field_index DESC limit 20;
..thus using the where clause and corresponding index instead of offset. OK, now the questions:
1) is this the best way to improve the initial query?
2) how can we populate that date_field_index field? we have to provide some trigger for this?
3) We should not use RowNumber() functions in Postgres because they are not using indexes and thus very slow. Is it correct?
2) Why column order in concatenated index is not affecting performance of the query?
My measurements show, that while searching using concatenated index (index consisting of 2 and more columns) there is no difference if we place the most selective column to the first place - or if we place it to the end. Why? If we place the most selective column to the first place - we run through a shorter range of the found rows which should have impact on performance. Am I right?

Use the primary key to untie in instead of the date_field_index column. Otherwise explain why that is not an option.
order by date_field DESC, "primary_key_column(s)" DESC
The combined index with the most unique column first is the best performer, but it will not be used if:
the distinct values are more than a few percent of the table
there aren't enough rows to make it worth
the range of dates is not small enough
What is the output of explain my_query?

Related

Fastest PostgreSQL query for field length?

If I need to find the maximum length of a few fields stored as numeric (i.e. variable length number) in postgresql so my team can build a fixed width file layout and the length isn't in the metadata, is there a faster way to get that info than either
select field
from table
where field is not null
order by field desc
limit 1;
or
select max(field)
from table;
?
The tables these fields are in have tens of millions of rows so these queries are taking quite a while. I'm a decent postgresql user, but optimizing for efficiency has never been my strong suite - I don't usually work with such large datasets. Any help is appreciated, even if this is a dumb question!
Your queries look fine. The where clause is not needed in the first query, that can be written as:
select myfield from mytable order by myfield desc nulls last limit 1;
Then, for performance, consider the following index:
create index myidx on mytable(myfield desc nulls last);
Actually Postgres should be able to read the index backwards, so this should be just as good:
create index myidx on mytable(myfield);
With any of these indexes in place, the database should be able to execute the whole query by looking at the index only, which should be very efficient.

Index scan for multicolumn comparison - non-uniform index column ordering

This question is closely related to Enforcing index scan for multicolumn comparison
The solution there is perfect, but seems to works only if all index columns have same ordering. This question is different because column b is desc here, and this fact stops from using row-syntax to solve the same problem. This is why I'm looking for another solution.
Suppose index is built for 3 columns (a asc, b DESC, c asc), I want Postgres to:
find key [a=10, b=20, c=30] in that B-tree,
scan next 10 entries and return them.
If the index has only one column the solution is obvious:
select * from table1 where a >= 10 order by a limit 10
But if there are more columns the solution becomes much more complex. For 3 columns:
select * from table1
where a > 10 or (a = 10 and (b < 20 or b = 20 and c <= 30))
order by a, b DESC, c
limit 10;
How can I tell Postgres that I want this operation?
And can I be sure that even for those complex queries for 2+ columns the optimizer will always understand that he should perform range-scan? Why?
PostgreSQL implements tuples very thoroughly, (unlike half implementations found in Oracle, DB2, SQL Server, etc.). You can write your condition using "tuples inequality", as in:
select *
from table1
where (a, -b, c) >= (10, -20, 30)
order by a, -b, c
limit 10
Please note that since the second column is in descending order, you must "invert" its value during the comparison. That's why it's expressed as -b and also, -20. This can be tricky for non-numeric columns such as dates, varchars, LOBs, etc.
Finally, the use of an index is still possible with the -b column value if you create an ad-hoc index, such as:
create index ix1 on table1 (a, (-b), c);
However, you can never force PostgreSQL to use an index. SQL is a declarative language, not an imperative one. You can entice it to do it by keeping table stats up to date, and also by selecting a small number of rows. If your LIMIT is too big, PostgreSQL may be inclined to use a full table scan instead.
Strictly speaking, your index on (a ASC, b DESC, c ASC) can still be used, but only based on the leading expression a. See:
Is a composite index also good for queries on the first field?
Working of indexes in PostgreSQL
It's usefulness is limited and Postgres will only use it if the predicate on a alone is selective enough (less than roughly 5% of all rows have a >= 10). (Or possibly to profit from an index-only scans where possible.) But all index tuples qualifying on only a have to be read and you will see a FILTER step in the query plan to discard non-qualifying rows - both adding additional cost. An index on just (a) typically does a better job as it's smaller and cheaper to maintain.
I have tried and failed in the past to make full use of an index with non-uniform sort order (ASC | DESC) like you display for ROW value comparison. I am pretty certain it's not possible. Think of it: Postgres compares whole row values, which can either be greater or smaller, but not both at the same time.
There are workarounds for datatypes with a defined negator (like - for numeric data types). See the solution provided by "The Impaler"! The trick is to invert values and wrap it in an expression index to get uniform sort order for all index expressions after all - which is currently the only way to tap into the full potential of row comparison. Be sure to make both WHERE conditions and ORDER BY match the special index.

PostgreSQL Query without WHERE only ORDER BY and LIMIT doesn't use index

I have a table that contains an 'id' column of type BIGSERIAL. I also have an index for this one column (sort order descending, BTREE, unique).
I often need to retrieve the last 10, 20, 30 entries from a table of millions of entries, like this:
SELECT * FROM table ORDER BY id DESC LIMIT 10
I would have thought it's a pretty clear case: there's an index for this particular field, sort order matches, and I need only 10 entries compared to millions in the whole table, this query definitely uses an index scan.
But it doesn't it does a sequential scan over the whole table.
I try to dig deeper, didn't find anything unusual. The Postgres doc at https://www.postgresql.org/docs/9.6/static/indexes-ordering.html says:
An important special case is ORDER BY in combination with LIMIT n: an
explicit sort will have to process all the data to identify the first
n rows, but if there is an index matching the ORDER BY, the first n
rows can be retrieved directly, without scanning the remainder at all.
But it still doesn't work. Does anybody have any pointers for me? Maybe I'm just not seeing the forrest for the trees anymore... :-(
Ok, saying it out loud and trying to gather more information to put into my question apparently made me see the forrest again, I found the actual problem. Further down in the doc I mentioned above is this sentence:
An index stored in ascending order with nulls first can satisfy either
ORDER BY x ASC NULLS FIRST or ORDER BY x DESC NULLS LAST depending on
which direction it is scanned in.
This was the problem. I specified the sort order in the index but I ignored the NULLS FIRST vs. LAST.
Postgres default is NULLS FIRST if you don't mention it explicitly in your query. So what Postgres found was the combination ORDER BY DESC NULLS FIRST which wasn't covered by my index. The combination of both SORT ORDER and NULLS is what matters.
The 2 possible solutions:
Either mention NULLS FIRST/LAST accordingly in the query so that it matches the index
...or change the index to NULLS FIRST (which is what I did)
Now Postgres is doing a proper index scan and only touches 10 elements during the query, not all of them.
If you need to get last 10 entries in table you can use this:
SELECT *
FROM table
WHERE id >= (SELECT MAX(id) FROM table) - 10
ORDER BY id DESC
And similarly for 20 and 30 entries.
This looks not so clear, but works fast as long as you have index for 'id' column.

db2 10.5 multi-column index explanation

My first time working with indexes in database and so far I've learn that if you have a multi-column index such as index('col1', 'col2', 'col3'), and if you do a query that uses where col2='col2' and col3='col3', that index would not be use.
I also learn that if a column is very low selectivity column. Indexing is useless.
However, from my test, it seems none of the above is true at all. Can someone explain more on this?
I have a table with more than 16 million records. Let's say claimID is the primary key, then there're a historynumber column that only have 3 distinct values (1,2,3), and a last column with storeNumber that has about 1 million distinct values.
I have an index for claimID alone, another index(historynumber, claimID), and other index with index(historynumber, storeNumber), and finally index(storeNumber, historynumber).
My guess was that if I do:
select * from my_table where claimId='123456' and historynumber = 1
would be much faster than
select * from my_table where historynumber = 1 and claimId = '123456'
However, the 2 have exactly the same performance (instant). So I thought the primary key index can work on any column order. Therefore, I tried the same thing but on historynumber and storeNumber instead. The result is exactly the same. Then I start trying out on columns that has no indexes and of course the result is the same also.
Finally, I do a
select * from my_table where historynumber = 1
and the query takes so long I had to cancel it.
So my conclusion is that the column order in where clause is completely useless, and so is the column order in the index definition since it seems like the database is smart enough to tell which column is the highest selectivity column.
Could someone give me an example that could prove otherwise?
Index explanation is a huge topic.
Don't worry about the sequence of different attributes in the SQL - it has no effect whether you specify
...where claimId='123456' and historynumber = 1
or the other way round. Each SQL is checked and optimized by the optimizer. To proove how the data gets accessed you could do a EXPLAIN. Check the documentation for more details.
For your other problem
select * from my_table where historynumber = 1
with an index of (storeNumber, historynumber).
Have you ever tried to lookup the name of a caller (having the telephone number) in a telephone book?
Well it is pretty much the same for an index - so the column order when creatin the index matters!
There are techniques which could help - i.e. index jump scan - but there is no guarantee.
Check out following sites to learn a little bit more about DB2 indexes:
http://db2commerce.com/2013/09/19/db2-luw-basics-indexes/
http://use-the-index-luke.com/sql/where-clause/the-equals-operator/concatenated-keys

Does column of integers contains value

What is the fastest regarding performance way to check that integer column contains specific value?
I have a table with 10 million rows in postgresql 8.4. I need to do at least 10000 checks per sec.
Currently i am doing query SELECT id FROM table WHERE id = my_value and then checking does DataReader have rows. But it is quite slow. Is there any way to speed up without loading whole column into memory?
You can select COUNT instead:
SELECT COUNT(*) FROM table WHERE id = my_value
It will return just one integer value - number of rows matching your select condition.
You need two things,
As Marcin pointed out, you want to use the COUNT(*) if all you need is to know how many. You also need an index on that column. The index will have the answer pretty much right at hand. Without the index, Postgresql would still have to go through the entire table to count that one number.
CREATE INDEX id_idx ON table (id) ASC NULLS LAST;
Something of the sort should get you there. Whether it is enough to run the query 10,000/sec. will depend on your hardware...
If you use where id = X then all values matching X will be returned. Suppose 1000 values match X then 1000 values will be returned.
Now, if you only want to check if the value is at least once then after you matched the first value there is no need to process the other 999. Even if you count the values you are still going through all of them.
What I would do in this case is this:
SELECT 1 FROM table
WHERE id = my_value
LIMIT 1
Note that I'm not even returning the id itself. So if you get one record then the value is there.
Of course, in order to improve this query, make sure you have an index on the id column.