Why using MAX function in query cause postgresql performance issue?

Why using MAX function in query cause postgresql performance issue? - sql

I have a table with three columns time_stamp, device_id and status s.t status type is json. Also time_stamp and device_id columns have index. I need to grab latest value of status with id 1.3.6.1.4.1.34094.1.1.1.1.1 which is not null.
You can find query execution time of following command With and Without using MAX bellow.
Query with MAX:
SELECT DISTINCT MAX(time_stamp) FROM device.status_events WHERE
(device_id = 7) AND
(status->'1.3.6.1.4.1.34094.1.1.1.1.1' != '{}');
Query without MAX:
SELECT DISTINCT time_stamp FROM device.status_events WHERE
(device_id = 7) AND
(status->'1.3.6.1.4.1.34094.1.1.1.1.1' != '{}');
First query takes about 3sec and second one takes just 3msec with two different plans. I think both queries should have same query plan, Why it does not use index in when it wants to calculate MAX? How can improve running time of first query?
PS I use postgres 9.6(dockerized version).
Also this is table definition.
-- Table: device.status_events
-- DROP TABLE device.status_events;
CREATE TABLE device.status_events
(
time_stamp timestamp with time zone NOT NULL,
device_id bigint,
status jsonb,
is_active boolean DEFAULT true,
CONSTRAINT status_events_device_id_fkey FOREIGN KEY (device_id)
REFERENCES device.devices (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE
)
WITH (
OIDS=FALSE
);
ALTER TABLE device.status_events
OWNER TO monitoring;
-- Index: device.status_events__time_stamp
-- DROP INDEX device.status_events__time_stamp;
CREATE INDEX status_events__time_stamp
ON device.status_events
USING btree
(time_stamp);

The index you show us cannot produce the first plan you show us. With that index, the plan would have to be applying a filter for the jsonb column, which it isn't. So the index must be a partial index, with the filter being applied at the index level so that it is not needed in the plan.
PostgreSQL is using an index for the max query, it just isn't the index you want it to.
All of your devide_id=7 must have low timestamps, but PostgreSQL doesn't know this. It thinks that by walking down the timestamps index, it will quickly find a device_id=7 and then be done. But instead it needs to walk a large chunk of the index before finding such a row.
You can force it away from the "wrong" index by changing the aggregate expression to something like:
MAX(time_stamp + interval '0')
Or you could instead build a more tailored index, which the planner will choose instead of the falsely attractive one:
create index on device.status_events (device_id , time_stamp)
where status->'1.3.6.1.4.1.34094.1.1.1.1.1' != '{}';

I believe this should generate a better plan
SELECT time_stamp FROM device.status_events WHERE
(device_id = 7) AND
(status->'1.3.6.1.4.1.34094.1.1.1.1.1' != '{}')
ORDER BY timestamp DESC
LIMIT 1
Let me know how that works for you.

Related

Create a unique index on a non-unique column

Not sure if this is possible in PostgreSQL 9.3+, but I'd like to create a unique index on a non-unique column. For a table like:
CREATE TABLE data (
id SERIAL
, day DATE
, val NUMERIC
);
CREATE INDEX data_day_val_idx ON data (day, val);
I'd like to be able to [quickly] query only the distinct days. I know I can use data_day_val_idx to help perform the distinct search, but it seems this adds extra overhead if the number of distinct values is substantially less than the number of rows in the index covers. In my case, about 1 in 30 days is distinct.
Is my only option to create a relational table to only track the unique entries? Thinking:
CREATE TABLE days (
day DATE PRIMARY KEY
);
And update this with a trigger every time we insert into data.

An index can only index actual rows, not aggregated rows. So, yes, as far as the desired index goes, creating a table with unique values like you mentioned is your only option. Enforce referential integrity with a foreign key constraint from data.day to days.day. This might also be best for performance, depending on the complete situation.
However, since this is about performance, there is an alternative solution: you can use a recursive CTE to emulate a loose index scan:
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT day FROM data ORDER BY 1 LIMIT 1
)
UNION ALL
SELECT (SELECT day FROM data WHERE day > c.day ORDER BY 1 LIMIT 1)
FROM cte c
WHERE c.day IS NOT NULL -- exit condition
)
SELECT day FROM cte;
Parentheses around the first SELECT are required because of the attached ORDER BY and LIMIT clauses. See:
Combining 3 SELECT statements to output 1 table
This only needs a plain index on day.
There are various variants, depending on your actual queries:
Optimize GROUP BY query to retrieve latest row per user
Unused index in range of dates query
Select first row in each GROUP BY group?
More in my answer to your follow-up querstion:
Counting distinct rows using recursive cte over non-distinct index

Db2 max(DATE_FIELD) optimization

I encountered this SQL statement in our legacy software:
select max(DATE_FIELD) from schema.VERY_LARGE_TABLE.
Now, this in itself would not be a problem, but VERY_LARGE_TABLE contains over 20,000,000 records and execution of the statement sometimes exceeds specified timeout of 30 seconds. This now represents an issue as users do not see a correct date.
I have stored this date in cache, so that once the date is obtained it is not obtained any more that day. But this does not help with original issue.
I was wondering, is there a way to optimize a table or SQL statement to perform faster than it does now?
[Edit] This behaviour is also only in place until table is in DB2 server's cache. After that, the query runs in approximately 2 seconds.
[Edit2] The platform is IBM DB2 9.7.6 for LUW.
[Edit3] DDL:
create table MYSCHEMA.VERY_LARGE_TABLE (
ID integer not null generated always as identity (start with 1,
increment by 1,
minvalue 1,
maxvalue 2147483647,
no cycle,
cache 20),
CODE_FIELD varchar(15) not null,
DATE_FIELD date default CURRENT DATE not null
)
in TS_LARGE;
create unique index MYSCHEMA.VERY_LARGE_TABLE_1 on MYSCHEMA.VERY_LARGE_TABLE (CODE_FIELD asc, DATE_FIELD asc);
create index MYSCHEMA.VERY_LARGE_TABLE_ARCHIVE on MYSCHEMA.VERY_LARGE_TABLE (DATE_FIELD asc, CODE_FIELD asc);
create index MYSCHEMA.INDX_VERY_LARGE_TABLE_ID on MYSCHEMA.VERY_LARGE_TABLE (ID asc);
alter table MYSCHEMA.VERY_LARGE_TABLE add constraint MYSCHEMA.PK_VERY_LARGE_TABLE primary key (CODE_FIELD, DATE_FIELD);

Couple of things you can do.
Create a index on the date_filed
Execute the select statement with with ur at the end of the statement like
select max(DATE_FIELD) from schema.VERY_LARGE_TABLE with ur this will speed up the process.
Try to increase the primary and secondary logfile sizes but take care to take the help of DBA before doing this as this will effect the database, If you are not sure then don't go for this option.
If there are bulk operations going on table issue frequent commits.

As long as you have an index only on DATE_FIELD DESCENDING, then try
SELECT DATE_FIELD
FROM VERY_LARGE_TABLE
ORDER BY DATE_FIELD DESC
FETCH FIRST ROW ONLY
WITH UR

sqlite3 select min, max together is much slower than select them separately

sqlite> explain query plan select max(utc_time) from RequestLog;
0|0|0|SEARCH TABLE RequestLog USING COVERING INDEX key (~1 rows) # very fast
sqlite> explain query plan select min(utc_time) from RequestLog;
0|0|0|SEARCH TABLE RequestLog USING COVERING INDEX key (~1 rows) # very fast
sqlite> explain query plan select min(utc_time), max(utc_time) from RequestLog;
0|0|0|SCAN TABLE RequestLog (~8768261 rows) # will be very very slow
While I use min and max separately, it works perfectly. However, sqlite will 'forget' the index while I select the min and max together for some reason. Is there any configuration I can do (I used Analyze already, it won't work)? or is there any explanation for this behavior?
EDIT1
sqlite> .schema
CREATE TABLE FixLog(
app_id text, __key__id INTEGER,
secret text, trace_code text, url text,
action text,facebook_id text,ip text,
tw_time datetime,time datetime,
tag text,to_url text,
from_url text,referer text,weight integer,
Unique(app_id, __key__id)
);
CREATE INDEX key4 on FixLog(action);
CREATE INDEX time on FixLog(time desc);
CREATE INDEX tw_time on FixLog(tw_time desc);
sqlite> explain query select min(time) from FixLog;
0|0|0|SEARCH TABLE FixLog USING COVERING INDEX time (~1 rows)
sqlite> explain query select max(time) from FixLog;
0|0|0|SEARCH TABLE FixLog USING COVERING INDEX time (~1 rows)
sqlite> explain query plan select max(time), min(time) from FixLog;
0|0|0|SCAN TABLE FixLog (~1000000 rows)

This is a known quirk of the sqlite query optimizer, as explained here: http://www.sqlite.org/optoverview.html#minmax:
Queries of the following forms will be optimized to run in logarithmic time assuming appropriate indices exist:
SELECT MIN(x) FROM table;
SELECT MAX(x) FROM table;
In order for these optimizations to occur, they must appear in exactly the form shown above - changing only the name of the table and column. It is not permissible to add a WHERE clause or do any arithmetic on the result. The result set must contain a single column. The column in the MIN or MAX function must be an indexed column.
UPDATE (2017/06/23): Recently, this has been updated to say that a query containing a single MAX or MIN might be satisfied by an index lookup (allowing for things like arithmetic); however, they still preclude having more than one such aggregation operator in a single query (so MIN,MAX will still be slow):
Queries that contain a single MIN() or MAX() aggregate function whose argument is the left-most column of an index might be satisfied by doing a single index lookup rather than by scanning the entire table. Examples:
SELECT MIN(x) FROM table;
SELECT MAX(x)+1 FROM table;

SQL Server Index Usage with an Order By

I have a table named Workflow. There are 38M rows in the table. There is a PK on the following columns:
ID: Identity Int
ReadTime: dateTime
If I perform the following query, the PK is not used. The query plan shows an index scan being performed on one of the nonclustered indexes plus a sort. It takes a very long time with 38M rows.
Select TOP 100 ID From Workflow
Where ID > 1000
Order By ID
However, if I perform this query, a nonclustered index (on LastModifiedTime) is used. The query plan shows an index seek being performed. The query is very fast.
Select TOP 100 * From Workflow
Where LastModifiedTime > '6/12/2010'
Order By LastModifiedTime
So, my question is this. Why isn't the PK used in the first query, but the nonclustered index in the second query is used?

Without being able to fish around in your database, there are a few things that come to my mind.
Are you certain that the PK is (id, ReadTime) as opposed to (ReadTime, id)?
What execution plan does SELECT MAX(id) FROM WorkFlow yield?
What about if you create an index on (id, ReadTime) and then retry the test, or your query?

Since Id is an identity column, having ReadTime participate in the index is superfluous. The clustered key already points to the leaf data. I recommended you modify your indexes
CREATE TABLE Workflow
(
Id int IDENTITY,
ReadTime datetime,
-- ... other columns,
CONSTRAINT PK_WorkFlow
PRIMARY KEY CLUSTERED
(
Id
)
)
CREATE INDEX idx_LastModifiedTime
ON WorkFlow
(
LastModifiedTime
)
Also, check that statistics are up to date.
Finally, If there are 38 million rows in this table, then the optimizer may conclude that specifying criteria > 1000 on a unique column is non selective, because > 99.997% of the Ids are > 1000 (if your identity seed started at 1). In order for an index to considered helpful, the optimizer must conclude that < 5% of the records would be selected. You can use an index hint to force the issue (as already stated by Dan Andrews). What is the structure of the non-clustered index that was scanned?

Efficient querying of multi-partition Postgres table

I've just restructured my database to use partitioning in Postgres 8.2. Now I have a problem with query performance:
SELECT *
FROM my_table
WHERE time_stamp >= '2010-02-10' and time_stamp < '2010-02-11'
ORDER BY id DESC
LIMIT 100;
There are 45 million rows in the table. Prior to partitioning, this would use a reverse index scan and stop as soon as it hit the limit.
After partitioning (on time_stamp ranges), Postgres does a full index scan of the master table and the relevant partition and merges the results, sorts them, then applies the limit. This takes way too long.
I can fix it with:
SELECT * FROM (
SELECT *
FROM my_table_part_a
WHERE time_stamp >= '2010-02-10' and time_stamp < '2010-02-11'
ORDER BY id DESC
LIMIT 100) t
UNION ALL
SELECT * FROM (
SELECT *
FROM my_table_part_b
WHERE time_stamp >= '2010-02-10' and time_stamp < '2010-02-11'
ORDER BY id DESC
LIMIT 100) t
UNION ALL
... and so on ...
ORDER BY id DESC
LIMIT 100
This runs quickly. The partitions where the times-stamps are out-of-range aren't even included in the query plan.
My question is: Is there some hint or syntax I can use in Postgres 8.2 to prevent the query-planner from scanning the full table but still using simple syntax that only refers to the master table?
Basically, can I avoid the pain of dynamically building the big UNION query over each partition that happens to be currently defined?
EDIT: I have constraint_exclusion enabled (thanks #Vinko Vrsalovic)

Have you tried Constraint Exclusion (section 5.9.4 in the document you've linked to)
Constraint exclusion is a query
optimization technique that improves
performance for partitioned tables
defined in the fashion described
above. As an example:
SET constraint_exclusion = on;
SELECT count(*) FROM measurement WHERE logdate >= DATE '2006-01-01';
Without
constraint exclusion, the above query
would scan each of the partitions of
the measurement table. With constraint
exclusion enabled, the planner will
examine the constraints of each
partition and try to prove that the
partition need not be scanned because
it could not contain any rows meeting
the query's WHERE clause. When the
planner can prove this, it excludes
the partition from the query plan.
You can use the EXPLAIN command to
show the difference between a plan
with constraint_exclusion on and a
plan with it off.

I had a similar problem that I was able fix by casting conditions in WHERE.
EG: (assuming the time_stamp column is timestamptz type)
WHERE time_stamp >= '2010-02-10'::timestamptz and time_stamp < '2010-02-11'::timestamptz
Also, make sure the CHECK condition on the table is defined the same way...
EG:
CHECK (time_stamp < '2010-02-10'::timestamptz)

I had the same problem and it boiled down to two reasons in my case:
I had indexed column of type timestamp WITH time zone and partition constraint by this column with type timestamp WITHOUT time zone.
After fixing constraints ANALYZE of all child tables was needed.
Edit: another bit of knowledge - it's important to remember that constraint exclusion (which allows PG to skip scanning some tables based on your partitioning criteria) doesn't work with, quote: non-immutable function such as CURRENT_TIMESTAMP
I had requests with CURRENT_DATE and it was part of my problem.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Why using MAX function in query cause postgresql performance issue? - sql

I believe this should generate a better plan SELECT time_stamp FROM device.status_events WHERE (device_id = 7) AND (status->'1.3.6.1.4.1.34094.1.1.1.1.1' != '{}') ORDER BY timestamp DESC LIMIT 1 Let me know how that works for you.

Related

Create a unique index on a non-unique column

Db2 max(DATE_FIELD) optimization

sqlite3 select min, max together is much slower than select them separately

SQL Server Index Usage with an Order By

Efficient querying of multi-partition Postgres table

Categories

Resources