Erratic indexed query performance in PostgreSQL - sql

Need help regarding performance of a query in PostgreSQL. It seems to relate to the indexes.
This query:
Filters according to type
Orders by timestamp, ascending:
SELECT * FROM the_table WHERE type = 'some_type' ORDER BY timestamp LIMIT 20
The Indexes:
CREATE INDEX the_table_timestamp_index ON the_table(timestamp);
CREATE INDEX the_table_type_index ON the_table(type);
The values of the type field are only ever one of about 11 different strings.
The problem is that the query seems to execute in O(log n) time, taking only a few milliseconds most times except for some values of type which take on the order of several minutes to run.
In these example queries, the first takes only a few milliseconds to run while the second takes over 30 minutes:
SELECT * FROM the_table WHERE type = 'goq' ORDER BY timestamp LIMIT 20
SELECT * FROM the_table WHERE type = 'csp' ORDER BY timestamp LIMIT 20
I suspect, with about 90% certainty, that the indexes we have are not the right ones. I think, after reading this similar question about index performance, that most likely what we need is a composite index, over type and timestamp.
The query plans that I have run are here:
Expected performance, type-specific index (i.e. new index with the type = 'csq' in the WHERE clause).
Slowest, problematic case, indexes as described above.
Fast case, same indexes as above.
Thanks very much for your help! Any pointers will be really appreciated!

The indexes can be used either for the where clause or the order by clause. With the index thetable(type, timestamp), then the same index can be used for both.
My guess is that Postgres is deciding which index to use based on statistics it gathers. When it uses the index for the where and then attempts a sort, you get really bad performance.
This is just a guess, but it is worth creating the above index to see if that fixes the performance problems.

The explain outputs all use the timestamp index. That is probably because the cardinality of the type column is too low so a scan on an index on that column is as expensive as a table scan.
The composite index to be created should be:
create index comp_index on the_table ("timestamp", type)
In that order.

Related

Fastest PostgreSQL query for field length?

If I need to find the maximum length of a few fields stored as numeric (i.e. variable length number) in postgresql so my team can build a fixed width file layout and the length isn't in the metadata, is there a faster way to get that info than either
select field
from table
where field is not null
order by field desc
limit 1;
or
select max(field)
from table;
?
The tables these fields are in have tens of millions of rows so these queries are taking quite a while. I'm a decent postgresql user, but optimizing for efficiency has never been my strong suite - I don't usually work with such large datasets. Any help is appreciated, even if this is a dumb question!
Your queries look fine. The where clause is not needed in the first query, that can be written as:
select myfield from mytable order by myfield desc nulls last limit 1;
Then, for performance, consider the following index:
create index myidx on mytable(myfield desc nulls last);
Actually Postgres should be able to read the index backwards, so this should be just as good:
create index myidx on mytable(myfield);
With any of these indexes in place, the database should be able to execute the whole query by looking at the index only, which should be very efficient.

Index scan for multicolumn comparison - non-uniform index column ordering

This question is closely related to Enforcing index scan for multicolumn comparison
The solution there is perfect, but seems to works only if all index columns have same ordering. This question is different because column b is desc here, and this fact stops from using row-syntax to solve the same problem. This is why I'm looking for another solution.
Suppose index is built for 3 columns (a asc, b DESC, c asc), I want Postgres to:
find key [a=10, b=20, c=30] in that B-tree,
scan next 10 entries and return them.
If the index has only one column the solution is obvious:
select * from table1 where a >= 10 order by a limit 10
But if there are more columns the solution becomes much more complex. For 3 columns:
select * from table1
where a > 10 or (a = 10 and (b < 20 or b = 20 and c <= 30))
order by a, b DESC, c
limit 10;
How can I tell Postgres that I want this operation?
And can I be sure that even for those complex queries for 2+ columns the optimizer will always understand that he should perform range-scan? Why?
PostgreSQL implements tuples very thoroughly, (unlike half implementations found in Oracle, DB2, SQL Server, etc.). You can write your condition using "tuples inequality", as in:
select *
from table1
where (a, -b, c) >= (10, -20, 30)
order by a, -b, c
limit 10
Please note that since the second column is in descending order, you must "invert" its value during the comparison. That's why it's expressed as -b and also, -20. This can be tricky for non-numeric columns such as dates, varchars, LOBs, etc.
Finally, the use of an index is still possible with the -b column value if you create an ad-hoc index, such as:
create index ix1 on table1 (a, (-b), c);
However, you can never force PostgreSQL to use an index. SQL is a declarative language, not an imperative one. You can entice it to do it by keeping table stats up to date, and also by selecting a small number of rows. If your LIMIT is too big, PostgreSQL may be inclined to use a full table scan instead.
Strictly speaking, your index on (a ASC, b DESC, c ASC) can still be used, but only based on the leading expression a. See:
Is a composite index also good for queries on the first field?
Working of indexes in PostgreSQL
It's usefulness is limited and Postgres will only use it if the predicate on a alone is selective enough (less than roughly 5% of all rows have a >= 10). (Or possibly to profit from an index-only scans where possible.) But all index tuples qualifying on only a have to be read and you will see a FILTER step in the query plan to discard non-qualifying rows - both adding additional cost. An index on just (a) typically does a better job as it's smaller and cheaper to maintain.
I have tried and failed in the past to make full use of an index with non-uniform sort order (ASC | DESC) like you display for ROW value comparison. I am pretty certain it's not possible. Think of it: Postgres compares whole row values, which can either be greater or smaller, but not both at the same time.
There are workarounds for datatypes with a defined negator (like - for numeric data types). See the solution provided by "The Impaler"! The trick is to invert values and wrap it in an expression index to get uniform sort order for all index expressions after all - which is currently the only way to tap into the full potential of row comparison. Be sure to make both WHERE conditions and ORDER BY match the special index.

Oracle sql statement on very large table

I relative new to sql and I have a statement which takes forever to run.
SELECT
sum(a.amountcur)
FROM
custtrans a
WHERE
a.transdate <= '2013-12-31';
I's a large table but the statemnt takes about 6 minutes!
Any ideas why?
Your select, as you post it, will read 99% of the whole table (2013-12-31 is just a week ago, and i assume most entries are before that date and only very few after). If your table has many large columns (like varchar2(4000)), all that data will be read as well when oracle scans the table. So you might read several KB each row just to get the 30 bytes you need for amountcur and transdate.
If you have this scenario. create a combined index on transdate and amountcur:
CREATE INDEX myindex ON custtrans(transdate, amountcur)
With the combined index, oracle can read the index to fulfill your query and doesn't have to touch the main table at all, which might result in considerably less data that needs to be read from disk.
Make sure the table has an index on transdate.
create index custtrans_idx on custtrans (transdate);
Also if this field is defined as a date in the table then do
SELECT sum(a.amountcur)
FROM custtrans a
WHERE a.transdate <= to_date('2013-12-31', 'yyyy-mm-dd');
If the table is really large, the query has to scan every row with transdate below given.
Even if you have an index on transdate and it helps to stop the scan early (which it may not), when the number of matching rows is very high, it would take considerable time to scan them all and sum the values.
To speed things up, you could calculate partial sums, e.g. for each passed month, assuming that your data is historical and past does not change. Then you'd only need to scan custtrans only for 1-2 months, then quickly scan the table with monthly sums, and add the results.
Try to create an index only on column amountcur:
CREATE INDEX myindex ON custtrans(amountcur)
In this case Oracle will read most probably only the Index (Index Full Scan), nothing else.
Correction, as mentioned in comment. It must be a composite Index:
CREATE INDEX myindex ON custtrans(transdate, amountcur)
But maybe it is a bit useless to create an index just for a single select statement.
One option is to create an index on the column used in the where clause (this is useful if you want to retrieve only 10-15% rows by using indexed column).
Another option is to partition your table if it has millions of rows. In this case also if you try to retrieve 70-80% data, it wont help.
The best option is first to analyze your requirements and then make a choice.
Whenever you deal with date functions it's better to use to_date() function. Do not rely on implicit data type conversion.

Getting RID Lookup instead of Table Scan?

SQL Fiddle: http://sqlfiddle.com/#!3/23cf8
In this query, when I have an In clause on an Id, and then also select other columns, the In is evaluated first, and then the Details column and other columns are pulled in via a RID Lookup:
--In production and in SQL Fiddle, Details is grabbed via a RID Lookup after the In clause is evaluated
SELECT [Id]
,[ForeignId]
,Details
--Generate a numbering(starting at 1)
--,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where foreignId In (1,2,3,5)
With this query, the Details are being pulled in via a Table Scan.
With NumberedContacts AS
(
SELECT [Id]
,[ForeignId]
--Generate a numbering(starting at 1)
,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where ForeignId In (1,2,3,5)
)
Select nc.[Id]
,nc.[ForeignId]
,sc.[Details]
From NumberedContacts nc
Inner Join SupportContacts sc on nc.Id = sc.Id
Where nc.ContactNumber <= 2 --Only grab the last 2 contacts per ForeignId
;
In SqlFiddle, the second query actually gets a RID Lookup, whereas in production with a million records it produces a Table Scan (the IN clause eliminates 99% of the rows)
Otherwise the query plan shown in SQL Fiddle is identical, the only difference being that for the second query the RID Lookup in SQL Fiddle, is a Table Scan in production :(
I would like to understand possibilities that would cause this behavior? What kinds of things would you look at to help determine the cause of it using a table scan here?
How can I influence it to use a RID Lookup there?
From looking at operation costs in the actual execution plan, I believe I can get the second query very close in performance to the first query if I can get it to use a RID Lookup. If I don't select the Detail column, then the performance of both queries is very close in production. It is only after adding other columns like Detail that performance degrades significantly for the second query. When I put it in SQL Fiddle and saw that the execution plan used an RID Lookup, I was surprised but slightly confused...
It doesn't have a clustered index because in testing with different clustered indexes, there was slightly worse performance for this and other queries. That was before I began adding other columns like Details though, and I can experiment with that more, but would like to have a understanding of what is going on now before I start shooting in the dark with random indexes.
What if you would change your main index to include the Details column?
If you use:
CREATE NONCLUSTERED INDEX [IX_SupportContacts_ForeignIdAsc_IdDesc]
ON SupportContacts ([ForeignId] ASC, [Id] DESC)
INCLUDE (Details);
then neither a RID lookup nor a table scan would be needed, since your query could be satisfied from just the index itself....
The differences in the query plans will be dependent on the types of indexes that exist and the statistics of the data for those tables in the different environments.
The optimiser uses the statistics (histograms of data frequency, mostly) and the available indexes to decide which execution plan is going to be the quickest.
So, for example, you have noticed that the performance degrades when the 'Details' column is included. This is an almost sure sign that either the 'Details' column is not part of an index, or if it is part of an index, the data in that column is mostly unique such that the index accesses would be equivalent (or almost equivalent) to a table scan.
Often when this situation arises, the optimiser will choose a table scan over the index access, as it can take advantage of things like block reads to access the table records faster than perhaps a fragmented read of an index.
To influence the path that will be chose by the optimiser, you would need to look at possible indexes that could be added/modified to make an index access more efficient, but this should be done with care as it can adversely affect other queries as well as possibly degrading insert performance.
The other important activity you can do to help the optimiser is to make sure the table statistics are kept up to date and refreshed at a frequency that is appropriate to the rate of change of the frequency distribution in the table data
If it's true that 99% of the rows would be omitted if it performed the query using the relevant index + RID then the likeliest problem in your production environment is that your statistics are out of date and the optimiser doesn't realise that ForeignID in (1,2,3,5) would limit the result set to 1% of the total data.
Here's a good link for discovering more about statistics from Pinal Dave: http://blog.sqlauthority.com/2010/01/25/sql-server-find-statistics-update-date-update-statistics/
As for forcing the optimiser to follow the correct path WITHOUT updating the statistics, you could use a table hint - if you know the index that your plan should be using which contains the ID and ForeignID columns then stick that in your query as a hint and force SQL optimiser to use the index:
http://msdn.microsoft.com/en-us/library/ms187373.aspx
FYI, if you want the best performance from your second query, use this index and avoid the headache you're experiencing altogether:
create index ix1 on SupportContacts(ForeignID, Id DESC) include (Details);

Why is my index not automatically used?

I have a table
Archive(VarId SMALLINT, Timestamp DATETIME, Value FLOAT)
VarId is not unique. The table contains measurements. I have a clustered index on Timestamp. Now i have the requirement of finding a measurement for a specific VarId before a specific date. So I do:
SELECT TOP(1) *
FROM Archive
WHERE VarId = 135
AND Timestamp < '2012-06-01 14:21:00'
ORDER BY Timestamp DESC;
If there is no such measurement this query searches the whole table. So I introduced another index on (VarId, Timestamp).
My problem is: SQL Server doesn't seem to care about it, the query still takes forever. When I explicitly state 'WITH (INDEX = <id>)' it works as it should. What can I do so SQL Server uses my index automatically?
I'm using SQL Server 2005.
There are different possibilities with this.
I'll try help you to isolate them:
It could be SQL Server is favouring your Clustered Index (very likely it's the Primary Key) over your newly created index. One way to solve this is to have a NonClustered Primary Key and cluster the index on the other two fields (varid and timestamp). That is, if you don't want varid and timestamp to be the PK.
Also, looking at the (estimated) execution plan might help.
But I believe #1 only works nicely if those 2 fields are the most commonly used (queried) index. To find out if this is the case, it would be good to analyse which index users are most likely use (from http://sqlblog.com/blogs/louis_davidson/archive/2007/07/22/sys-dm-db-index-usage-stats.aspx):
select
ObjectName = object_schema_name(indexes.object_id) + '.' + object_name(indexes.object_id),
indexes.name,
case when is_unique = 1 then 'UNIQUE ' else '' end + indexes.type_desc,
ddius.user_seeks,
ddius.user_scans,
ddius.user_lookups,
ddius.user_updates
from
sys.indexes
left join sys.dm_db_index_usage_stats ddius on (
indexes.object_id = ddius.object_id
and indexes.index_id = ddius.index_id
and ddius.database_id = db_id()
)
WHERE
object_schema_name(indexes.object_id) != 'sys' -- exclude sys objects
AND object_name(indexes.object_id) LIKE 'Archive'
order by
ddius.user_seeks + ddius.user_scans + ddius.user_lookups
desc
Good luck
My guess is that your index design is the issue. You have a CLUSTERED index on a DATETIME field and I suspect that it is not unique data, much like VarId, and hence you did not declare it as UNIQUE. Because it is not unique there is a hidden, 4-byte "uniqueifier" field (so that each row can by physically unique regardless of you not giving it unique data) and the rows with the same DATETIME value are essentially random within the group of same DATETIME values (so even narrowing down a time still requires scanning through that grouping). You also have a NONCLUSTERED index on VarId, Timestamp. NONCLUSTERED indexes include the data from the CLUSTERED index so internally your NONCLUSTERED index is really: VarId, Timestamp, Timestamp (from the CLUSTERED index). So you could have left off the Timestamp column in the NONCLUSTERED index and it would have all been the same to the optimizer, but in a sense it would have been better as it would be a smaller index.
So your physical layout is based on a date while the VarId values are spread across those dates. Hence VarId = 135 can be spread very far apart in terms of data pages. Yes, your non-clustered index does group them together, but the optimizer is probably looking at the fact that you are wanting all fields (the "SELECT *" part) and the Timestamp < '2012-06-01 14:21:00' condition in addition to that seems to get most of what you need as opposed to finding a few rows and doing a bookmark lookup to get the "Value" field to fulfill the "SELECT *". Quite possibly if you do just "SELECT TOP(1) VarId, Timestamp" it would more likely use your NONCLUSTERED index without needing the "INDEX =" hint.
Another issue affecting performance overall could be that the ORDER BY is requesting the Timestamp in DESC order and if you have the CLUSTERED index in ASC order then it would be the opposite direction of what you are looking for (at least in this query). Of course, in that case then it might be ok to have Timestamp in the NONCLUSTERED index if it was in DESC order.
My advice is to rethink the CLUSTERED index. Judging on just this query alone (other queries/uses might alter the recommendation), try dropping the NONCLUSTERED index and recreate the CLUSTERED index with the Timestamp field first, in DESC order, and also with the VarId so it can be delcared UNIQUE. So:
CREATE UNIQUE CLUSTERED INDEX [UIX_Archive_Timestamp_VarId]
ON Archive (Timestamp DESC, VarId ASC)
This, of course, assumes that the Timestamp and VarId combination is unique. If not, then still try this without the UNIQUE keyword.
Update:
To pull all of this info and advice together:
When designing indexes you need to consider the distribution of the data and the use-cases for interacting with it. More often than not there is A LOT to consider and several different approaches will appear good in theory. You need to try a few approaches, profile/test them, and see which works best in reality. There is no "always do this" approach without knowing all aspects of what you are doing and what else is going on and what else is planned to use and/or modify this table which I suspect has not been presented in the original question.
So to start the journey, you are ordering records by date and are looking at ranges of dates AND dates naturally occur in order so putting Timestamp first benefits more of what you are doing and has less fragmentation, especially if defined as DESC in the CREATE. Having an NC index on just VarId at that point will then be fine, even if spread out, for looking at a set of rows for a particular VarId. So maybe start there (change order of direction of CLUSTERED index and remove Timestamp from the NC index). See how those changes compare to the existing structure. Then try moving the VarId field into the CLUSTERED index and remove the NC index. You say that the combination is also not unique but does increase the predictability of the ordering of the rows. See how that works. Does this table ever get updated? If not and if the Value field along with Timestamp and VarId would be unique, then try adding that to the CLUSTERED index and be sure to create with the UNIQUE keyword. See how these different approaches work by looking at the Actual Execution Plan and use SET STATISTICS IO ON before running the query and see how the logical reads between the different approaches compare.
Hope this helps :)
You might need to analyze your table to collect statistics, so the optimizer can determine whether to use the index or not.