Slow LEFT JOIN on CTE with time intervals

Slow LEFT JOIN on CTE with time intervals - sql

I am trying to debug a query in PostgreSQL that I've built to bucket market data in time buckets in arbitrary time intervals. Here is my table definition:
CREATE TABLE historical_ohlcv (
exchange_symbol TEXT NOT NULL,
symbol_id TEXT NOT NULL,
kafka_key TEXT NOT NULL,
open NUMERIC,
high NUMERIC,
low NUMERIC,
close NUMERIC,
volume NUMERIC,
time_open TIMESTAMP WITH TIME ZONE NOT NULL,
time_close TIMESTAMP WITH TIME ZONE,
CONSTRAINT historical_ohlcv_pkey
PRIMARY KEY (exchange_symbol, symbol_id, time_open)
);
CREATE INDEX symbol_id_idx
ON historical_ohlcv (symbol_id);
CREATE INDEX open_close_symbol_id
ON historical_ohlcv (time_open, time_close, exchange_symbol, symbol_id);
CREATE INDEX time_open_idx
ON historical_ohlcv (time_open);
CREATE INDEX time_close_idx
ON historical_ohlcv (time_close);
The table has ~25m rows currently. My query as an example for 1 hour, but could be 5 mins, 10 mins, 2 days, etc.
EXPLAIN ANALYZE WITH vals AS (
SELECT
NOW() - '5 months' :: INTERVAL AS frame_start,
NOW() AS frame_end,
INTERVAL '1 hour' AS t_interval
)
, grid AS (
SELECT
start_time,
lead(start_time, 1)
OVER (
ORDER BY start_time ) AS end_time
FROM (
SELECT
generate_series(frame_start, frame_end,
t_interval) AS start_time,
frame_end
FROM vals
) AS x
)
SELECT max(high)
FROM grid g
LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
WHERE exchange_symbol = 'BINANCE'
AND symbol_id = 'ETHBTC'
GROUP BY start_time;
The WHERE clause could be any valid value in the table.
This technique was inspired by:
Best way to count records by arbitrary time intervals in Rails+Postgres.
The idea is to make a common table and left join your data with that to indicate which bucket stuff is in. This query is really slow! It's currently taking 15s. Based on the query planner, we have a really expensive nested loop:
QUERY PLAN
HashAggregate (cost=2758432.05..2758434.05 rows=200 width=40) (actual time=16023.713..16023.817 rows=542 loops=1)
Group Key: g.start_time
CTE vals
-> Result (cost=0.00..0.02 rows=1 width=32) (actual time=0.005..0.005 rows=1 loops=1)
CTE grid
-> WindowAgg (cost=64.86..82.36 rows=1000 width=16) (actual time=2.986..9.594 rows=3625 loops=1)
-> Sort (cost=64.86..67.36 rows=1000 width=8) (actual time=2.981..4.014 rows=3625 loops=1)
Sort Key: x.start_time
Sort Method: quicksort Memory: 266kB
-> Subquery Scan on x (cost=0.00..15.03 rows=1000 width=8) (actual time=0.014..1.991 rows=3625 loops=1)
-> ProjectSet (cost=0.00..5.03 rows=1000 width=16) (actual time=0.013..1.048 rows=3625 loops=1)
-> CTE Scan on vals (cost=0.00..0.02 rows=1 width=32) (actual time=0.008..0.009 rows=1 loops=1)
-> Nested Loop (cost=0.56..2694021.34 rows=12865667 width=14) (actual time=7051.730..16015.873 rows=31978 loops=1)
-> CTE Scan on grid g (cost=0.00..20.00 rows=1000 width=16) (actual time=2.988..11.635 rows=3625 loops=1)
-> Index Scan using historical_ohlcv_pkey on historical_ohlcv ohlcv (cost=0.56..2565.34 rows=12866 width=22) (actual time=3.712..4.413 rows=9 loops=3625)
Index Cond: ((exchange_symbol = 'BINANCE'::text) AND (symbol_id = 'ETHBTC'::text) AND (time_open >= g.start_time))
Filter: (time_close < g.end_time)
Rows Removed by Filter: 15502
Planning time: 0.568 ms
Execution time: 16023.979 ms
My guess is this line is doing a lot:
LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
AND ohlcv.time_close < g.end_time
But I'm not sure how to accomplish this in another way.
P.S. apologies if this belongs to dba.SE. I read the FAQ and this seemed too basic for that site, so I posted here.
Edits as requested:
SELECT avg(pg_column_size(t)) FROM historical_ohlcv t TABLESAMPLE SYSTEM (0.1); returns 107.632
For exchange_symbol, there are 3 unique values, for symbol_id there are ~400
PostgreSQL version: PostgreSQL 10.3 (Ubuntu 10.3-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit.
The table will be growing about ~1m records a day, so not exactly read-only. All this stuff is done locally and I will try to move to RDS or to help manage hardware issues.
Related: if I wanted to add other aggregates, specifically 'first in the bucket', 'last in the bucket', min, sum, would my indexing strategy change?

Correctness first: I suspect a bug in your query:
LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
AND ohlcv.time_close < g.end_time
Unlike my referenced answer, you join on a time interval: (time_open, time_close]. The way you do it excludes rows in the table where the interval crosses bucket borders. Only intervals fully contained in a single bucket count. I don't think that's intended?
A simple fix would be to decide bucket membership based on time_open (or time_close) alone. If you want to keep working with both, you have to define exactly how to deal with intervals overlapping with multiple buckets.
Also, you are looking for max(high) per bucket, which is different in nature from count(*) in my referenced answer.
And your buckets are simple intervals per hour?
Then we can radically simplify. Working with just time_open:
SELECT date_trunc('hour', time_open) AS hour, max(high) AS max_high
FROM historical_ohlcv
WHERE exchange_symbol = 'BINANCE'
AND symbol_id = 'ETHBTC'
AND time_open >= now() - interval '5 months' -- frame_start
AND time_open < now() -- frame_end
GROUP BY 1
ORDER BY 1;
Related:
Resample on time series data
It's hard to talk about further performance optimization while basics are unclear. And we'd need more information.
Are WHERE conditions variable?
How many distinct values in exchange_symbol and symbol_id?
Avg. row size? What do you get for:
SELECT avg(pg_column_size(t)) FROM historical_ohlcv t TABLESAMPLE SYSTEM (0.1);
Is the table read-only?
Assuming you always filter on exchange_symbol and symbol_id and values are variable, your table is read-only or autovacuum can keep up with the write load so we can hope for index-only scans, you would best have a multicolumn index on (exchange_symbol, symbol_id, time_open, high DESC) to support this query. Index columns in this order. Related:
Multicolumn index and performance
Depending on data distribution and other details a LEFT JOIN LATERAL solution might be another option. Related:
How to find an average of values for time intervals in postgres
Optimize GROUP BY query to retrieve latest record per user
Aside from all that, you EXPLAIN plan exhibits some very bad estimates:
https://explain.depesz.com/s/E5yI
Are you using a current version of Postgres? You may have to work on your server configuration - or at least set higher statistics targets on relevant columns and more aggressive autovacuum settings for the big table. Related:
Keep PostgreSQL from sometimes choosing a bad query plan
Aggressive Autovacuum on PostgreSQL

Related

Postgres / Postgis query optimization

I have a query in Postgres / Postgis which is based around finding the nearest points of a given point filtered by some other columns in the table.
The table consists of a bit over 10 million rows and the query looks like this:
SELECT t.id FROM my_table t
WHERE round(t.col1) IN (1,2,3)
ORDER BY t.geom <-> st_transform(st_setsrid('POINT(lon lat)'::geometry, 4326), 3857)
LIMIT 1000;
The geom column is indexed using GIST and col1 is also indexed.
When the WHERE clause finds many rows that are also near the point this is blazing fast using the geom index:
Limit (cost=0.42..10575.49 rows=1000 width=12) (actual time=0.150..6.742 rows=1000 loops=1)
-> Index Scan using my_table_geom_idx on my_table t (cost=0.42..2148612.35 rows=203177 width=12) (actual time=0.149..6.663 rows=1000 loops=1)
Order By: (geom <-> '.....'::geometry)
Filter: (round(t.col1) = ANY ('{1,2,3}'::double precision[]))
Rows Removed by Filter: 3348
Planning Time: 0.288 ms
Execution Time: 6.817 ms
The problem occurs when the WHERE clause does not find many rows that are close in distance to the given point. Example:
SELECT t.id FROM my_table t
WHERE round(t.col1) IN (1) // 1 is very rare near the given point
ORDER BY t.geom <-> st_transform(st_setsrid('POINT(lon lat)'::geometry, 4326), 3857)
LIMIT 1000;
This query runs much much slower:
Limit (cost=0.42..14487.97 rows=1000 width=12) (actual time=8443.514..10629.745 rows=1000 loops=1)
-> Index Scan using my_table_geom_idx on my_table t (cost=0.42..1962368.41 rows=135452 width=12) (actual time=8443.513..10629.553 rows=1000 loops=1)
Order By: (t.geom <-> '.....'::geometry)
Filter: (round(t.col1) = ANY ('{1}'::double precision[]))
Rows Removed by Filter: 5866030
Planning Time: 0.292 ms
Execution Time: 10629.906 ms
I created an index on round(col1) trying to speed up searches on col1 but postgres uses the geom index only which works great when there are many rows nearby that fit the criteria but not so great if there are few rows that match.
If I remove the LIMIT clause Postgres uses the index on col1, which works great when there are few resulting rows but is very slow when the result contains many rows, so I would like to keep the LIMIT clause.
Any suggestions on how I could optimize this query or create an index that handles this?
EDIT:
Thank you for all the suggestions and feedback!
I tried the tip from #JGH and restricted my query using st_dwithin as to not order the entire table before limiting.
...where st_dwithin(geom, searchpoint, 10000)
This greatly reduced the time of the slow query, down to a few milliseconds. Restricting the search to a constant distance works well in my application, so I will use this as the solution.

Order by by subquery result is ridiculously slow

i'm a little confused here.
Here is my (simplified) query:
SELECT *
from (SELECT documents.*,
(SELECT max(date)
FROM registrations
WHERE registrations.document_id = documents.id) AS register_date
FROM documents) AS dcmnts
ORDER BY register_date
LIMIT 20;
And here is my EXPLAIN ANALYSE results:
Limit (cost=46697025.51..46697025.56 rows=20 width=193) (actual time=80329.201..80329.206 rows=20 loops=1)
-> Sort (cost=46697025.51..46724804.61 rows=11111641 width=193) (actual time=80329.199..80329.202 rows=20 loops=1)
Sort Key: ((SubPlan 1))
Sort Method: top-N heapsort Memory: 29kB
-> Seq Scan on documents (cost=0.00..46401348.74 rows=11111641 width=193) (actual time=0.061..73275.304 rows=11114254 loops=1)
SubPlan 1
-> Aggregate (cost=3.95..4.05 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=11114254)
-> Index Scan using registrations_document_id_index on registrations (cost=0.43..3.95 rows=2 width=4) (actual time=0.004..0.004 rows=1 loops=11114254)
Index Cond: (document_id = documents.id)
Planning Time: 0.334 ms
Execution Time: 80329.287 ms
The query takes 1m 20s to execute, is there any way to optimize it? There are lots of rows in these tables (documents:11114642;registrations:13176070).
In actual full query I also have some more filters and it takes up to 4 seconds to execute, and it's still too slow. This subquery orderby seems to be the bottleneck here and I can't figure out the way to optimize it.
I tried to set indexes on date/document_id columns.

Don't use a scalar subquery:
SELECT documents.*,
reg.register_date
FROM documents
JOIN (
SELECT document_id, max(date) as register_date
FROM registrations
GROUP BY document_id
) reg on reg.document_id = documents.id;
ORDER BY register_date
LIMIT 20;

Try to unnest the query
SELECT documents.id,
documents.other_attr,
max(registrations.date) register_date
FROM documents
JOIN registrations ON registrations.document_id = documents.id
GROUP BY documents.id, documents.other_attr
ORDER BY 2
LIMIT 20
The query should be supported at least by an index on registrations(document_id, date):
create index idx_registrations_did_date
on registrations(document_id, date)

In actual full query I also have some more filters and it takes up to 4 seconds to execute, and it's still too slow.
Then ask about that query. What can we say about a query we can't see? Clearly this other query isn't just like this query, except for filtering stuff out after all the work is done, as then it couldn't be faster (other than due to cache hotness) than the one you showed us. It is doing something different, it has to optimized differently.
This subquery orderby seems to be the bottleneck here and I can't figure out the way to optimize it.
The timing for the sort node includes the time if all the work that preceded it, so the time of the actual sort is 80329.206 - 73275.304 = 7 seconds, a long time perhaps but a minority of the total time. (This interpretation is not very obvious from the output itself--it from experience.)
For the query you did show us, you can get it to be quite fast, but only probabilistically correct, by using a rather convoluted construction.
with t as (select date, document_id from registrations
order by date desc, document_id desc limit 200),
t2 as (select distinct on (document_id) document_id, date from t
order by document_id, date desc),
t3 as ( select document_id, date from t2 order by date desc limit 20)
SELECT documents.*,
t3.date as register_date
FROM documents join t3 on t3.document_id = documents.id;
order by register_date
It will be efficiently supported by:
create index on registrations (register_date, document_id);
create index on documents(id);
The idea here is that the 200 most recent registrations will have at least 20 distinct document_id among them. Of course there is no way to know for sure that that will be true, so you might have to increase 200 to 20000 (which should still be pretty fast, compared to what you are currently doing) or even more to be sure you get the right answer. This also assumes that every distinct document_id matches exactly one document.id.

postgresql - join between two large tables takes very long

I do have two rather large tables and I need to do a date range join between those. Unfortunately the query takes over 12 hours. I am using postgresql 10.5 running in docker with max. 5GB of ram and up to 12 CPU cores available.
Basically in the left table I do have an Equipment ID and a list of date ranges (from = Timestamp, to = ValidUntil). I then want to join the right table, which has measurements (sensor data) for all of the equipments, so that I only get the sensor data that lies within one of the date ranges (from the left table). Query:
select
A.*,
B."Timestamp" as "PressureTimestamp",
B."PropertyValue" as "Pressure"
from A
inner join B
on B."EquipmentId" = A."EquipmentId"
and B."Timestamp" >= A."Timestamp"
and B."Timestamp" < A."ValidUntil"
This query unfortunately is only utilizing one core, which might be the reason that it is running that slow. Is there a way to rewrite the query so it can be parallelized?
Indexes:
create index if not exists A_eq_timestamp_validUntil on public.A using btree ("EquipmentId", "Timestamp", "ValidUntil");
create index if not exists B_eq_timestamp on public.B using btree ("EquipmentId", "Timestamp");
Tables:
-- contains 332,000 rows
CREATE TABLE A (
"EquipmentId" bigint,
"Timestamp" timestamp without time zone,
"ValidUntil" timestamp without time zone
)
WITH ( OIDS = FALSE )
-- contains 70,000,000 rows
CREATE TABLE B
(
"EquipmentId" bigint,
"Timestamp" timestamp without time zone,
"PropertyValue" double precision
)
WITH ( OIDS = FALSE )
Execution plan (explain ... output):
Nested Loop (cost=176853.59..59023908.95 rows=941684055 width=48)
-> Bitmap Heap Scan on v2_pressure p (cost=176853.16..805789.35 rows=9448335 width=24)
Recheck Cond: ("EquipmentId" = 2956235)
-> Bitmap Index Scan on v2_pressure_eq (cost=0.00..174491.08 rows=9448335 width=0)
Index Cond: ("EquipmentId" = 2956235)"
-> Index Scan using v2_prs_eq_timestamp_validuntil on v2_prs prs (cost=0.42..5.16 rows=100 width=32)
Index Cond: (("EquipmentId" = 2956235) AND (p."Timestamp" >= "Timestamp") AND (p."Timestamp" < "ValidUntil"))
Update 1:
Fixed the indexes, according to comments, which improved performance a lot

Index correction is the first resort to fix slowness but it will only help to some extent. Given that your tables are big I would recommend to try Postgres Partition . It has some inbuilt support from postgres.
But you need to have some filter/partition criteria. I don't see any where clause in your query so can't suggest. May be you can try equipmentId. This can also help in achieving parallelism.

-- \i tmp.sql
CREATE TABLE A
( equipmentid bigint NOT NULL
, ztimestamp timestamp without time zone NOT NULL
, validuntil timestamp without time zone NOT NULL
, PRIMARY KEY (equipmentid,ztimestamp)
, UNIQUE (equipmentid,validuntil) -- mustbeunique, since the intervals dont overlap
) ;
-- contains 70,000,000 rows
CREATE TABLE B
( equipmentid bigint NOT NULL
, ztimestamp timestamp without time zone NOT NULL
, propertyvalue double precision
, PRIMARY KEY (equipmentid,ztimestamp)
) ;
INSERT INTO B(equipmentid,ztimestamp,propertyvalue)
SELECT i,t, random()
FROM generate_series(1,1000) i
CROSS JOIN generate_series('2018-09-01','2018-09-30','1day'::interval) t
;
INSERT INTO A(equipmentid,ztimestamp,validuntil)
SELECT equipmentid,ztimestamp, ztimestamp+ '7 days'::interval
FROM B
WHERE date_part('dow', ztimestamp) =0
;
ANALYZE A;
ANALYZE B;
EXPLAIN
SELECT
A.*,
B.ztimestamp AS pressuretimestamp,
B.propertyvalue AS pressure
FROM A
INNER JOIN B
ON B.equipmentid = A.equipmentid
AND B.ztimestamp >= A.ztimestamp
AND B.ztimestamp < A.validuntil
WHERE A.equipmentid=333 -- I added this, the plan in the question also has a r estriction on Id
;
And the resulting plan:
SET
ANALYZE
ANALYZE
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.34..21.26 rows=17 width=40)
-> Index Scan using a_equipmentid_validuntil_key on a (cost=0.17..4.34 rows=5 width=24)
Index Cond: (equipmentid = 333)
-> Index Scan using b_pkey on b (cost=0.17..3.37 rows=3 width=24)
Index Cond: ((equipmentid = 333) AND (ztimestamp >= a.ztimestamp) AND (ztimestamp < a.validuntil))
(5 rowSET
That is with my current setting of random_page_cost=1.1;
After setting it to 4.0, I get the same plan as the OP:
SET
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=35.13..54561.69 rows=1416136 width=40) (actual time=1.391..1862.275 rows=225540 loops=1)
-> Bitmap Heap Scan on aa2 (cost=34.71..223.52 rows=1345 width=24) (actual time=1.173..5.223 rows=1345 loops=1)
Recheck Cond: (equipmentid = 5)
Heap Blocks: exact=9
-> Bitmap Index Scan on aa2_equipmentid_validuntil_key (cost=0.00..34.38 rows=1345 width=0) (actual time=1.047..1.048 rows=1345 loops=1)
Index Cond: (equipmentid = 5)
-> Index Scan using bb2_pkey on bb2 (cost=0.42..29.87 rows=1053 width=24) (actual time=0.109..0.757 rows=168 loops=1345)
Index Cond: ((equipmentid = 5) AND (ztimestamp >= aa2.ztimestamp) AND (ztimestamp < aa2.validuntil))
Planning Time: 3.167 ms
Execution Time: 2168.967 ms
(10 rows)

How to optimize a MAX SQL query with GROUP BY DATE

I'm trying to optimize a query from a table with 3M rows.
The columns are value, datetime and point_id.
SELECT DATE(datetime), MAX(value) FROM historical_points WHERE point_id=1 GROUP BY DATE(datetime);
This query takes 2 seconds.
I tried indexing the point_id=1 but the results were not much better.
Is it possible to index the MAX query or is there a better way to do it? Maybe with an INNER JOIN?
EDIT:
This is the explain analyze of similar one, that is tackling the case better. This one also ha performance problem.
EXPLAIN ANALYZE SELECT DATE(datetime), MAX(value), MIN(value) FROM buildings_hispoint WHERE point_id=64 AND datetime BETWEEN '2017-09-01 00:00:00' AND '2017-10-01 00:00:00' GROUP BY DATE(datetime);
>GroupAggregate (cost=84766.65..92710.99 rows=336803 width=68) (actual time=1461.060..2701.145 rows=21 loops=1)
> Group Key: (date(datetime))
> -> Sort (cost=84766.65..85700.23 rows=373430 width=14) (actual time=1408.445..1547.929 rows=523621 loops=1)
> Sort Key: (date(datetime))
> Sort Method: external sort Disk: 11944kB
> -> Bitmap Heap Scan on buildings_hispoint (cost=10476.02..43820.81 rows=373430 width=14) (actual time=148.970..731.154 rows=523621 loops=1)
> Recheck Cond: (point_id = 64)
> Filter: ((datetime >= '2017-09-01 00:00:00+02'::timestamp with time zone) AND (datetime Rows Removed by Filter: 35712
> Heap Blocks: exact=14422
> -> Bitmap Index Scan on buildings_measurementdatapoint_ffb10c68 (cost=0.00..10382.67 rows=561898 width=0) (actual time=125.150..125.150 rows=559333 loops=1)
> Index Cond: (point_id = 64)
>Planning time: 0.284 ms
>Execution time: 2704.566 ms

Without seeing EXPLAIN output is difficult to say something. My guess is that you must include DATE() call on index definition:
CREATE INDEX historical_points_idx ON historical_points (DATE(datetime), point_id);
Also, if point_id has more distinct values than DATE(datetime) then you must reverse column order:
CREATE INDEX historical_points_idx ON historical_points (point_id, DATE(datetime));
Keep in mind that cardinality of columns is very important to the planner, columns with high selectivity is preferred to go first.

SELECT DISTINCT ON (DATE(datetime)) DATE(datetime), value
FROM historical_points WHERE point_id=1
ORDER BY DATE(datetime) DESC, value DESC;
Put an computed index on DATE(datetime), value. [I hope those aren't your real column names. Using reserved words like VALUE as a column name is a recipe for confusion.]
The SELECT DISTINCT will work like a GROUP ON. The ORDER BY replaces the MAX, and will be fast if indexed.
I owe this technique to #ErwinBrandstetter.

Performance of max() vs ORDER BY DESC + LIMIT 1

I was troubleshooting a few slow SQL queries today and don't quite understand the performance difference below:
When trying to extract the max(timestamp) from a data table based on some condition, using MAX() is slower than ORDER BY timestamp LIMIT 1 if a matching row exists, but considerably faster if no matching row is found.
SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4
ORDER BY timestamp DESC
LIMIT 1;
(0 rows)
Time: 1314.544 ms
SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5
ORDER BY timestamp DESC
LIMIT 1;
(1 row)
Time: 10.890 ms
SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4;
(0 rows)
Time: 0.869 ms
SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5;
(1 row)
Time: 84.087 ms
There are indexes on (timestamp) and (sensor_id, timestamp), and I noticed that Postgres uses very different query plans and indexes for both cases:
QUERY PLAN (ORDER BY)
--------------------------------------------------------------------------------------------------------
Limit (cost=0.43..9.47 rows=1 width=8)
-> Nested Loop (cost=0.43..396254.63 rows=43823 width=8)
Join Filter: (data.sensor_id = sensors.id)
-> Index Scan using timestamp_ind on data (cost=0.43..254918.66 rows=4710976 width=12)
-> Materialize (cost=0.00..6.70 rows=2 width=4)
-> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4)
Filter: (station_id = 4)
(7 rows)
QUERY PLAN (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate (cost=3680.59..3680.60 rows=1 width=8)
-> Nested Loop (cost=0.43..3571.03 rows=43823 width=8)
-> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4)
Filter: (station_id = 4)
-> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12)
Index Cond: (sensor_id = sensors.id)
(6 rows)
So my two questions are:
Where does this performance difference come from? I've seen the accepted answer here MIN/MAX vs ORDER BY and LIMIT, but that doesn't quite seem to apply here. Any good resources would be appreciated.
Are there better ways to increase performance in all cases (matching row vs no matching row) than adding an EXISTS check?
EDIT to address the questions in the comments below. I kept the initial query plans above for future reference:
Table definitions:
Table "public.sensors"
Column | Type | Modifiers
----------------------+------------------------+-----------------------------------------------------------------
id | integer | not null default nextval('sensors_id_seq'::regclass)
station_id | integer | not null
....
Indexes:
"sensor_primary" PRIMARY KEY, btree (id)
"ind_station_id" btree (station_id, id)
"ind_station" btree (station_id)
Table "public.data"
Column | Type | Modifiers
-----------+--------------------------+------------------------------------------------------------------
id | integer | not null default nextval('data_id_seq'::regclass)
timestamp | timestamp with time zone | not null
sensor_id | integer | not null
avg | integer |
Indexes:
"timestamp_ind" btree ("timestamp" DESC)
"sensor_ind" btree (sensor_id)
"sensor_ind_timestamp" btree (sensor_id, "timestamp")
"sensor_ind_timestamp_desc" btree (sensor_id, "timestamp" DESC)
Note that I added ind_station_id on sensors just now after #Erwin's suggestion below. Timings haven't really changed drastically, still >1200ms in the ORDER BY DESC + LIMIT 1 case and ~0.9ms in the MAX case.
Query Plans:
QUERY PLAN (ORDER BY)
----------------------------------------------------------------------------------------------------------
Limit (cost=0.58..9.62 rows=1 width=8) (actual time=2161.054..2161.054 rows=0 loops=1)
Buffers: shared hit=3418066 read=47326
-> Nested Loop (cost=0.58..396382.45 rows=43823 width=8) (actual time=2161.053..2161.053 rows=0 loops=1)
Join Filter: (data.sensor_id = sensors.id)
Buffers: shared hit=3418066 read=47326
-> Index Scan using timestamp_ind on data (cost=0.43..255048.99 rows=4710976 width=12) (actual time=0.047..1410.715 rows=4710976 loops=1)
Buffers: shared hit=3418065 read=47326
-> Materialize (cost=0.14..4.19 rows=2 width=4) (actual time=0.000..0.000 rows=0 loops=4710976)
Buffers: shared hit=1
-> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: (station_id = 4)
Heap Fetches: 0
Buffers: shared hit=1
Planning time: 0.478 ms
Execution time: 2161.090 ms
(15 rows)
QUERY (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate (cost=3678.08..3678.09 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=1)
Buffers: shared hit=1
-> Nested Loop (cost=0.58..3568.52 rows=43823 width=8) (actual time=0.006..0.006 rows=0 loops=1)
Buffers: shared hit=1
-> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.005..0.005 rows=0 loops=1)
Index Cond: (station_id = 4)
Heap Fetches: 0
Buffers: shared hit=1
-> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12) (never executed)
Index Cond: (sensor_id = sensors.id)
Heap Fetches: 0
Planning time: 0.435 ms
Execution time: 0.048 ms
(13 rows)
So just like in the earlier explains, ORDER BY does a Scan using timestamp_in on data, which is not done in the MAX case.
Postgres version:
Postgres from the Ubuntu repos: PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit
Note that there are NOT NULL constraints in place, so ORDER BY won't have to sort over empty rows.
Note also that I'm largely interested in where the difference comes from. While not ideal, I can retrieve data relatively quickly using EXISTS (<1ms) and then SELECT (~11ms).

There does not seem to be an index on sensor.station_id, which is important here.
There is an actual difference between max() and ORDER BY DESC + LIMIT 1. Many people seem to miss that. NULL values sort first in descending sort order. So ORDER BY timestamp DESC LIMIT 1 returns a NULL value if one exists, while the aggregate function max() ignores NULL values and returns the latest not-null timestamp. ORDER BY timestamp DESC NULLS LAST LIMIT 1 would be equivalent
For your case, since your column d.timestamp is defined NOT NULL (as your update revealed), there is no effective difference. An index with DESC NULLS LAST and the same clause in the ORDER BY for the LIMIT query should still serve you best. I suggest these indexes (my query below builds on the 2nd one):
sensor(station_id, id)
data(sensor_id, timestamp DESC NULLS LAST)
You can drop the other indexes sensor_ind_timestamp and sensor_ind_timestamp_desc unless they are in use otherwise (unlikely, but possible).
Much more importantly, there is another difficulty: The filter on the first table sensors returns few, but still (possibly) multiple rows. Postgres expects to find 2 rows (rows=2) in your added EXPLAIN output.
The perfect technique would be an index-skip-scan (a.k.a. loose index scan) for the second table data - which is not currently implemented (up to at least Postgres 15). There are various workarounds. See:
Optimize GROUP BY query to retrieve latest row per user
The best should be:
SELECT d.timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT timestamp
FROM data
WHERE sensor_id = s.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) d
WHERE s.station_id = 4
ORDER BY d.timestamp DESC NULLS LAST
LIMIT 1;
The choice between max() and ORDER BY / LIMIT hardly matters in comparison. You might as well:
SELECT max(d.timestamp) AS timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT timestamp
FROM data
WHERE sensor_id = s.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) d
WHERE s.station_id = 4;
Or:
SELECT max(d.timestamp) AS timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT max(timestamp) AS timestamp
FROM data
WHERE sensor_id = s.id
) d
WHERE s.station_id = 4;
Or even with a correlated subquery, shortest of all:
SELECT max((SELECT max(timestamp) FROM data WHERE sensor_id = s.id)) AS timestamp
FROM sensors s
WHERE station_id = 4;
Note the double parentheses!
The additional advantage of LIMIT in a LATERAL join is that you can retrieve arbitrary columns of the selected row, not just the latest timestamp (one column).
Related:
Why do NULL values come first when ordering DESC in a PostgreSQL query?
What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
Select first row in each GROUP BY group?
Optimize groupwise maximum query

The query plan shows index names timestamp_ind and timestamp_sensor_ind. But indexes like that do not help with a search for a particular sensor.
To resolve an equals query (like sensor.id = data.sensor_id) the column has to be the first in the index. Try to add an index that allows searching on sensor_id and, within a sensor, is sorted by timestamp:
create index sensor_timestamp_ind on data(sensor_id, timestamp);
Does adding that index speed up the query?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas