How to optimize a MAX SQL query with GROUP BY DATE - sql

I'm trying to optimize a query from a table with 3M rows.
The columns are value, datetime and point_id.
SELECT DATE(datetime), MAX(value) FROM historical_points WHERE point_id=1 GROUP BY DATE(datetime);
This query takes 2 seconds.
I tried indexing the point_id=1 but the results were not much better.
Is it possible to index the MAX query or is there a better way to do it? Maybe with an INNER JOIN?
EDIT:
This is the explain analyze of similar one, that is tackling the case better. This one also ha performance problem.
EXPLAIN ANALYZE SELECT DATE(datetime), MAX(value), MIN(value) FROM buildings_hispoint WHERE point_id=64 AND datetime BETWEEN '2017-09-01 00:00:00' AND '2017-10-01 00:00:00' GROUP BY DATE(datetime);
>GroupAggregate (cost=84766.65..92710.99 rows=336803 width=68) (actual time=1461.060..2701.145 rows=21 loops=1)
> Group Key: (date(datetime))
> -> Sort (cost=84766.65..85700.23 rows=373430 width=14) (actual time=1408.445..1547.929 rows=523621 loops=1)
> Sort Key: (date(datetime))
> Sort Method: external sort Disk: 11944kB
> -> Bitmap Heap Scan on buildings_hispoint (cost=10476.02..43820.81 rows=373430 width=14) (actual time=148.970..731.154 rows=523621 loops=1)
> Recheck Cond: (point_id = 64)
> Filter: ((datetime >= '2017-09-01 00:00:00+02'::timestamp with time zone) AND (datetime Rows Removed by Filter: 35712
> Heap Blocks: exact=14422
> -> Bitmap Index Scan on buildings_measurementdatapoint_ffb10c68 (cost=0.00..10382.67 rows=561898 width=0) (actual time=125.150..125.150 rows=559333 loops=1)
> Index Cond: (point_id = 64)
>Planning time: 0.284 ms
>Execution time: 2704.566 ms

Without seeing EXPLAIN output is difficult to say something. My guess is that you must include DATE() call on index definition:
CREATE INDEX historical_points_idx ON historical_points (DATE(datetime), point_id);
Also, if point_id has more distinct values than DATE(datetime) then you must reverse column order:
CREATE INDEX historical_points_idx ON historical_points (point_id, DATE(datetime));
Keep in mind that cardinality of columns is very important to the planner, columns with high selectivity is preferred to go first.

SELECT DISTINCT ON (DATE(datetime)) DATE(datetime), value
FROM historical_points WHERE point_id=1
ORDER BY DATE(datetime) DESC, value DESC;
Put an computed index on DATE(datetime), value. [I hope those aren't your real column names. Using reserved words like VALUE as a column name is a recipe for confusion.]
The SELECT DISTINCT will work like a GROUP ON. The ORDER BY replaces the MAX, and will be fast if indexed.
I owe this technique to #ErwinBrandstetter.

Related

Why is my query uses filtering instead of index cond when I use an `OR` condition?

I have a transactions table in PostgreSQL with block_height and index as BIGINT values. Those two values are used for determining the order of the transactions in this table.
So if I want to query transactions from this table that comes after the given block_height and index, I'd have to put this on the condition
If two transactions are in the same block_height, then check the ordering of their index
Otherwise compare their block_height
For example if I want to get 10 transactions that came after block_height 100000 and index 5:
SELECT * FROM transactions
WHERE (
(block_height = 10000 AND index > 5)
OR (block_height > 10000)
)
ORDER BY block_height, index ASC
LIMIT 10
However I find this query to be extremely slow, it took up to 60 seconds for a table with 50 million rows.
However if I split up the condition and run them individually like so:
SELECT * FROM transactions
WHERE block_height = 10000 AND index > 5
ORDER BY block_height, index ASC
LIMIT 10
and
SELECT * FROM transactions
WHERE block_height > 10000
ORDER BY block_height, index ASC
LIMIT 10
Both queries took at most 200ms on the same table! It is much faster to do both queries and then UNION the final result instead of putting an OR in the condition.
This is the part of the query plan for the slow query (OR-ed condition):
-> Nested Loop (cost=0.98..11689726.68 rows=68631 width=73) (actual time=10230.480..10234.289 rows=10 loops=1)
-> Index Scan using src_transactions_block_height_index on src_transactions (cost=0.56..3592792.96 rows=16855334 width=73) (actual time=10215.698..10219.004 rows=1364 loops=1)
Filter: (((block_height = $1) AND (index > $2)) OR (block_height > $3))
Rows Removed by Filter: 2728151
And this is the query plan for the fast query:
-> Nested Loop (cost=0.85..52.62 rows=1 width=73) (actual time=0.014..0.014 rows=0 loops=1)
-> Index Scan using src_transactions_block_height_index on src_transactions (cost=0.43..22.22 rows=5 width=73) (actual time=0.014..0.014 rows=0 loops=1)
Index Cond: ((block_height = $1) AND (index > $2))
I see the main difference to be the use of Filter instead of Index Cond between the query plans.
Is there any way to do this query in a performant way without resorting to the UNION workaround?
The fact that block_height is compared to two different parameters which you know just happen to be equal might be a problem. What if you use $1 twice, rather than $1 and $3?
But better yet, try a tuple comparison
WHERE (block_height, index) > (10000, 5)
This can become fast with a two-column index on (block_height, index).

Postgres: which index to add

I have a table mainly used by this query (only 3 columns are in use here, meter, timeStampUtc and createdOnUtc, but there are other in the table), which starts to take too long:
select
rank() over (order by mr.meter, mr."timeStampUtc") as row_name
, max(mr."createdOnUtc") over (partition by mr.meter, mr."timeStampUtc") as "createdOnUtc"
from
"MeterReading" mr
where
"createdOnUtc" >= '2021-01-01'
order by row_name
;
(this is the minimal query to show my issue. It might not make too much sense on its own, or could be rewritten)
I am wondering which index (or other technique) to use to optimise this particular query.
A basic index on createdOnUtc helps already.
I am mostly wondering about those 2 windows functions. They are very similar, so I factorised them (named window with thus identical partition by and order by), it had no effect. Adding an index on meter, "timeStampUtc" had no effect either (query plan unchanged).
Is there no way to use an index on those 2 columns inside a window function?
Edit - explain analyze output: using the createdOnUtc index
Sort (cost=8.51..8.51 rows=1 width=40) (actual time=61.045..62.222 rows=26954 loops=1)
Sort Key: (rank() OVER (?))
Sort Method: quicksort Memory: 2874kB
-> WindowAgg (cost=8.46..8.50 rows=1 width=40) (actual time=18.373..57.892 rows=26954 loops=1)
-> WindowAgg (cost=8.46..8.48 rows=1 width=40) (actual time=18.363..32.444 rows=26954 loops=1)
-> Sort (cost=8.46..8.46 rows=1 width=32) (actual time=18.353..19.663 rows=26954 loops=1)
Sort Key: meter, "timeStampUtc"
Sort Method: quicksort Memory: 2874kB
-> Index Scan using "MeterReading_createdOnUtc_idx" on "MeterReading" mr (cost=0.43..8.45 rows=1 width=32) (actual time=0.068..8.059 rows=26954 loops=1)
Index Cond: ("createdOnUtc" >= '2021-01-01 00:00:00'::timestamp without time zone)
Planning Time: 0.082 ms
Execution Time: 63.698 ms
Is there no way to use an index on those 2 columns inside a window function?
That is correct; a window function cannot use an index, as the work only on what otherwise would be the final result, all data selection has already finished. From the documentation.
The rows considered by a window function are those of the “virtual
table” produced by the query's FROM clause as filtered by its WHERE,
GROUP BY, and HAVING clauses if any. For example, a row removed
because it does not meet the WHERE condition is not seen by any window
function. A query can contain multiple window functions that slice up
the data in different ways using different OVER clauses, but they all
act on the same collection of rows defined by this virtual table.
The purpose of an index is to speed up the creation of that "virtual table". Applying an index would just slow things down: the data is already in memory. Scanning it is orders of magnitude faster any any index.

Slow LEFT JOIN on CTE with time intervals

I am trying to debug a query in PostgreSQL that I've built to bucket market data in time buckets in arbitrary time intervals. Here is my table definition:
CREATE TABLE historical_ohlcv (
exchange_symbol TEXT NOT NULL,
symbol_id TEXT NOT NULL,
kafka_key TEXT NOT NULL,
open NUMERIC,
high NUMERIC,
low NUMERIC,
close NUMERIC,
volume NUMERIC,
time_open TIMESTAMP WITH TIME ZONE NOT NULL,
time_close TIMESTAMP WITH TIME ZONE,
CONSTRAINT historical_ohlcv_pkey
PRIMARY KEY (exchange_symbol, symbol_id, time_open)
);
CREATE INDEX symbol_id_idx
ON historical_ohlcv (symbol_id);
CREATE INDEX open_close_symbol_id
ON historical_ohlcv (time_open, time_close, exchange_symbol, symbol_id);
CREATE INDEX time_open_idx
ON historical_ohlcv (time_open);
CREATE INDEX time_close_idx
ON historical_ohlcv (time_close);
The table has ~25m rows currently. My query as an example for 1 hour, but could be 5 mins, 10 mins, 2 days, etc.
EXPLAIN ANALYZE WITH vals AS (
SELECT
NOW() - '5 months' :: INTERVAL AS frame_start,
NOW() AS frame_end,
INTERVAL '1 hour' AS t_interval
)
, grid AS (
SELECT
start_time,
lead(start_time, 1)
OVER (
ORDER BY start_time ) AS end_time
FROM (
SELECT
generate_series(frame_start, frame_end,
t_interval) AS start_time,
frame_end
FROM vals
) AS x
)
SELECT max(high)
FROM grid g
LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
WHERE exchange_symbol = 'BINANCE'
AND symbol_id = 'ETHBTC'
GROUP BY start_time;
The WHERE clause could be any valid value in the table.
This technique was inspired by:
Best way to count records by arbitrary time intervals in Rails+Postgres.
The idea is to make a common table and left join your data with that to indicate which bucket stuff is in. This query is really slow! It's currently taking 15s. Based on the query planner, we have a really expensive nested loop:
QUERY PLAN
HashAggregate (cost=2758432.05..2758434.05 rows=200 width=40) (actual time=16023.713..16023.817 rows=542 loops=1)
Group Key: g.start_time
CTE vals
-> Result (cost=0.00..0.02 rows=1 width=32) (actual time=0.005..0.005 rows=1 loops=1)
CTE grid
-> WindowAgg (cost=64.86..82.36 rows=1000 width=16) (actual time=2.986..9.594 rows=3625 loops=1)
-> Sort (cost=64.86..67.36 rows=1000 width=8) (actual time=2.981..4.014 rows=3625 loops=1)
Sort Key: x.start_time
Sort Method: quicksort Memory: 266kB
-> Subquery Scan on x (cost=0.00..15.03 rows=1000 width=8) (actual time=0.014..1.991 rows=3625 loops=1)
-> ProjectSet (cost=0.00..5.03 rows=1000 width=16) (actual time=0.013..1.048 rows=3625 loops=1)
-> CTE Scan on vals (cost=0.00..0.02 rows=1 width=32) (actual time=0.008..0.009 rows=1 loops=1)
-> Nested Loop (cost=0.56..2694021.34 rows=12865667 width=14) (actual time=7051.730..16015.873 rows=31978 loops=1)
-> CTE Scan on grid g (cost=0.00..20.00 rows=1000 width=16) (actual time=2.988..11.635 rows=3625 loops=1)
-> Index Scan using historical_ohlcv_pkey on historical_ohlcv ohlcv (cost=0.56..2565.34 rows=12866 width=22) (actual time=3.712..4.413 rows=9 loops=3625)
Index Cond: ((exchange_symbol = 'BINANCE'::text) AND (symbol_id = 'ETHBTC'::text) AND (time_open >= g.start_time))
Filter: (time_close < g.end_time)
Rows Removed by Filter: 15502
Planning time: 0.568 ms
Execution time: 16023.979 ms
My guess is this line is doing a lot:
LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
AND ohlcv.time_close < g.end_time
But I'm not sure how to accomplish this in another way.
P.S. apologies if this belongs to dba.SE. I read the FAQ and this seemed too basic for that site, so I posted here.
Edits as requested:
SELECT avg(pg_column_size(t)) FROM historical_ohlcv t TABLESAMPLE SYSTEM (0.1); returns 107.632
For exchange_symbol, there are 3 unique values, for symbol_id there are ~400
PostgreSQL version: PostgreSQL 10.3 (Ubuntu 10.3-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit.
The table will be growing about ~1m records a day, so not exactly read-only. All this stuff is done locally and I will try to move to RDS or to help manage hardware issues.
Related: if I wanted to add other aggregates, specifically 'first in the bucket', 'last in the bucket', min, sum, would my indexing strategy change?
Correctness first: I suspect a bug in your query:
LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
AND ohlcv.time_close < g.end_time
Unlike my referenced answer, you join on a time interval: (time_open, time_close]. The way you do it excludes rows in the table where the interval crosses bucket borders. Only intervals fully contained in a single bucket count. I don't think that's intended?
A simple fix would be to decide bucket membership based on time_open (or time_close) alone. If you want to keep working with both, you have to define exactly how to deal with intervals overlapping with multiple buckets.
Also, you are looking for max(high) per bucket, which is different in nature from count(*) in my referenced answer.
And your buckets are simple intervals per hour?
Then we can radically simplify. Working with just time_open:
SELECT date_trunc('hour', time_open) AS hour, max(high) AS max_high
FROM historical_ohlcv
WHERE exchange_symbol = 'BINANCE'
AND symbol_id = 'ETHBTC'
AND time_open >= now() - interval '5 months' -- frame_start
AND time_open < now() -- frame_end
GROUP BY 1
ORDER BY 1;
Related:
Resample on time series data
It's hard to talk about further performance optimization while basics are unclear. And we'd need more information.
Are WHERE conditions variable?
How many distinct values in exchange_symbol and symbol_id?
Avg. row size? What do you get for:
SELECT avg(pg_column_size(t)) FROM historical_ohlcv t TABLESAMPLE SYSTEM (0.1);
Is the table read-only?
Assuming you always filter on exchange_symbol and symbol_id and values are variable, your table is read-only or autovacuum can keep up with the write load so we can hope for index-only scans, you would best have a multicolumn index on (exchange_symbol, symbol_id, time_open, high DESC) to support this query. Index columns in this order. Related:
Multicolumn index and performance
Depending on data distribution and other details a LEFT JOIN LATERAL solution might be another option. Related:
How to find an average of values for time intervals in postgres
Optimize GROUP BY query to retrieve latest record per user
Aside from all that, you EXPLAIN plan exhibits some very bad estimates:
https://explain.depesz.com/s/E5yI
Are you using a current version of Postgres? You may have to work on your server configuration - or at least set higher statistics targets on relevant columns and more aggressive autovacuum settings for the big table. Related:
Keep PostgreSQL from sometimes choosing a bad query plan
Aggressive Autovacuum on PostgreSQL

Query Issue - Postgres /pgadmin4

My query is taking hours to run, its over 9M rows in Postgres 9.3 in PgAdmin 4. The data is all string due to the format it came in, i had to use a subquery to break it into columns. So in order to craete the aggregated views i cast each of the elements i need. Because there are millions of rows, there are data elements that have nulls at some point in the table, so i needed the where statement. I have indecies for each of the items in the where clause. All basic btree indexes. Example: Create index loan_age_idx on "gnma2" using btree ("Loan_Age"). The where statement though which is required due to the cast functions are resulting in zero rows returned. What can i do to remove the wheres but keep the cast statement?
create table "t1" as
Select
gnma2."Issuer_ID",
gnma2."As_of_Date",
gnma2."Agency",
gnma2."Loan_Purpose",
gnma2."State",
gnma2."Months_Delinquent",
count(gnma2."Disclosure_Sequence_Number") as "Total_Loan_Count",
avg(cast(gnma2."Loan_Interest_Rate" as double precision))/1000 as "avg_int_rate",
avg(cast(gnma2."Original_Principal_Balance" as real))/100 as "avg_OUPB",
avg(cast(gnma2."Unpaid_Principal_Balance" as real))/100 as "avg_UPB",
avg(cast(gnma2."Loan_Age" as real)) as "avg_loan_age",
avg(cast(gnma2."Loan_To_Value" as real))/100 as "avg_LTV",
avg(cast(gnma2."Total_Debt_Expense_Ratio_Percent" as real))/100 as "avg_DTI",
avg(cast(gnma2."Credit_Score" as real)) as "avg_credit_score",
left(gnma2."First_Payment_Date",4) as "Origination_Yr"
From public."gnma2"
where
"Loan_Age" >= '1' and
"Loan_To_Value" >= '0' and
"Total_Debt_Expense_Ratio_Percent" >= '0' and
"Credit_Score" >= '0' and
"Loan_Interest_Rate" >= '0' and
"Original_Principal_Balance" >= '1' and
"Unpaid_Principal_Balance" >= '0'
Group By
gnma2."Issuer_ID",
gnma2."Agency",
gnma2."Loan_Purpose",
gnma2."State",
gnma2."As_of_Date",
gnma2."Months_Delinquent",
gnma2."First_Payment_Date";
QUERY PLAN
HashAggregate (cost=3496556.07..3496556.65 rows=13 width=91) (actual
time=124214.207..124214.207 rows=0 loops=1)
-> Bitmap Heap Scan on gnma2
(cost=166750.75..3496546.65 rows=130 width=91) (actual time=124214.202..124214.20
2 rows=0 loops=1)
Recheck Cond: ("Loan_Age" >= '1'::bpchar)
Rows Removed by Index Recheck: 4785233
Filter: (("Loan_To_Value" >= '0'::bpchar) AND ("Total_Debt_Expense_Ratio_Percent" >= '0'::bpchar) AND ("Cr
edit_Score" >= '0'::bpchar) AND ("Loan_Interest_Rate" >= '0'::bpchar) AND ("Original_Principal_Balance" >= '1'::bpc
har) AND ("Unpaid_Principal_Balance" >= '0'::bpchar))
Rows Removed by Filter: 9311198
-> Bitmap Index Scan on loan_age_indx (cost=0.00..166750.72 rows=9029087 width=0) (actual time=5015.865.
.5015.865 rows=9311198 loops=1)
Index Cond: ("Loan_Age" >= '1'::bpchar)
Total runtime: 124214.524 ms

Performance of max() vs ORDER BY DESC + LIMIT 1

I was troubleshooting a few slow SQL queries today and don't quite understand the performance difference below:
When trying to extract the max(timestamp) from a data table based on some condition, using MAX() is slower than ORDER BY timestamp LIMIT 1 if a matching row exists, but considerably faster if no matching row is found.
SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4
ORDER BY timestamp DESC
LIMIT 1;
(0 rows)
Time: 1314.544 ms
SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5
ORDER BY timestamp DESC
LIMIT 1;
(1 row)
Time: 10.890 ms
SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4;
(0 rows)
Time: 0.869 ms
SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5;
(1 row)
Time: 84.087 ms
There are indexes on (timestamp) and (sensor_id, timestamp), and I noticed that Postgres uses very different query plans and indexes for both cases:
QUERY PLAN (ORDER BY)
--------------------------------------------------------------------------------------------------------
Limit (cost=0.43..9.47 rows=1 width=8)
-> Nested Loop (cost=0.43..396254.63 rows=43823 width=8)
Join Filter: (data.sensor_id = sensors.id)
-> Index Scan using timestamp_ind on data (cost=0.43..254918.66 rows=4710976 width=12)
-> Materialize (cost=0.00..6.70 rows=2 width=4)
-> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4)
Filter: (station_id = 4)
(7 rows)
QUERY PLAN (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate (cost=3680.59..3680.60 rows=1 width=8)
-> Nested Loop (cost=0.43..3571.03 rows=43823 width=8)
-> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4)
Filter: (station_id = 4)
-> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12)
Index Cond: (sensor_id = sensors.id)
(6 rows)
So my two questions are:
Where does this performance difference come from? I've seen the accepted answer here MIN/MAX vs ORDER BY and LIMIT, but that doesn't quite seem to apply here. Any good resources would be appreciated.
Are there better ways to increase performance in all cases (matching row vs no matching row) than adding an EXISTS check?
EDIT to address the questions in the comments below. I kept the initial query plans above for future reference:
Table definitions:
Table "public.sensors"
Column | Type | Modifiers
----------------------+------------------------+-----------------------------------------------------------------
id | integer | not null default nextval('sensors_id_seq'::regclass)
station_id | integer | not null
....
Indexes:
"sensor_primary" PRIMARY KEY, btree (id)
"ind_station_id" btree (station_id, id)
"ind_station" btree (station_id)
Table "public.data"
Column | Type | Modifiers
-----------+--------------------------+------------------------------------------------------------------
id | integer | not null default nextval('data_id_seq'::regclass)
timestamp | timestamp with time zone | not null
sensor_id | integer | not null
avg | integer |
Indexes:
"timestamp_ind" btree ("timestamp" DESC)
"sensor_ind" btree (sensor_id)
"sensor_ind_timestamp" btree (sensor_id, "timestamp")
"sensor_ind_timestamp_desc" btree (sensor_id, "timestamp" DESC)
Note that I added ind_station_id on sensors just now after #Erwin's suggestion below. Timings haven't really changed drastically, still >1200ms in the ORDER BY DESC + LIMIT 1 case and ~0.9ms in the MAX case.
Query Plans:
QUERY PLAN (ORDER BY)
----------------------------------------------------------------------------------------------------------
Limit (cost=0.58..9.62 rows=1 width=8) (actual time=2161.054..2161.054 rows=0 loops=1)
Buffers: shared hit=3418066 read=47326
-> Nested Loop (cost=0.58..396382.45 rows=43823 width=8) (actual time=2161.053..2161.053 rows=0 loops=1)
Join Filter: (data.sensor_id = sensors.id)
Buffers: shared hit=3418066 read=47326
-> Index Scan using timestamp_ind on data (cost=0.43..255048.99 rows=4710976 width=12) (actual time=0.047..1410.715 rows=4710976 loops=1)
Buffers: shared hit=3418065 read=47326
-> Materialize (cost=0.14..4.19 rows=2 width=4) (actual time=0.000..0.000 rows=0 loops=4710976)
Buffers: shared hit=1
-> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: (station_id = 4)
Heap Fetches: 0
Buffers: shared hit=1
Planning time: 0.478 ms
Execution time: 2161.090 ms
(15 rows)
QUERY (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate (cost=3678.08..3678.09 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=1)
Buffers: shared hit=1
-> Nested Loop (cost=0.58..3568.52 rows=43823 width=8) (actual time=0.006..0.006 rows=0 loops=1)
Buffers: shared hit=1
-> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.005..0.005 rows=0 loops=1)
Index Cond: (station_id = 4)
Heap Fetches: 0
Buffers: shared hit=1
-> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12) (never executed)
Index Cond: (sensor_id = sensors.id)
Heap Fetches: 0
Planning time: 0.435 ms
Execution time: 0.048 ms
(13 rows)
So just like in the earlier explains, ORDER BY does a Scan using timestamp_in on data, which is not done in the MAX case.
Postgres version:
Postgres from the Ubuntu repos: PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit
Note that there are NOT NULL constraints in place, so ORDER BY won't have to sort over empty rows.
Note also that I'm largely interested in where the difference comes from. While not ideal, I can retrieve data relatively quickly using EXISTS (<1ms) and then SELECT (~11ms).
There does not seem to be an index on sensor.station_id, which is important here.
There is an actual difference between max() and ORDER BY DESC + LIMIT 1. Many people seem to miss that. NULL values sort first in descending sort order. So ORDER BY timestamp DESC LIMIT 1 returns a NULL value if one exists, while the aggregate function max() ignores NULL values and returns the latest not-null timestamp. ORDER BY timestamp DESC NULLS LAST LIMIT 1 would be equivalent
For your case, since your column d.timestamp is defined NOT NULL (as your update revealed), there is no effective difference. An index with DESC NULLS LAST and the same clause in the ORDER BY for the LIMIT query should still serve you best. I suggest these indexes (my query below builds on the 2nd one):
sensor(station_id, id)
data(sensor_id, timestamp DESC NULLS LAST)
You can drop the other indexes sensor_ind_timestamp and sensor_ind_timestamp_desc unless they are in use otherwise (unlikely, but possible).
Much more importantly, there is another difficulty: The filter on the first table sensors returns few, but still (possibly) multiple rows. Postgres expects to find 2 rows (rows=2) in your added EXPLAIN output.
The perfect technique would be an index-skip-scan (a.k.a. loose index scan) for the second table data - which is not currently implemented (up to at least Postgres 15). There are various workarounds. See:
Optimize GROUP BY query to retrieve latest row per user
The best should be:
SELECT d.timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT timestamp
FROM data
WHERE sensor_id = s.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) d
WHERE s.station_id = 4
ORDER BY d.timestamp DESC NULLS LAST
LIMIT 1;
The choice between max() and ORDER BY / LIMIT hardly matters in comparison. You might as well:
SELECT max(d.timestamp) AS timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT timestamp
FROM data
WHERE sensor_id = s.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) d
WHERE s.station_id = 4;
Or:
SELECT max(d.timestamp) AS timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT max(timestamp) AS timestamp
FROM data
WHERE sensor_id = s.id
) d
WHERE s.station_id = 4;
Or even with a correlated subquery, shortest of all:
SELECT max((SELECT max(timestamp) FROM data WHERE sensor_id = s.id)) AS timestamp
FROM sensors s
WHERE station_id = 4;
Note the double parentheses!
The additional advantage of LIMIT in a LATERAL join is that you can retrieve arbitrary columns of the selected row, not just the latest timestamp (one column).
Related:
Why do NULL values come first when ordering DESC in a PostgreSQL query?
What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
Select first row in each GROUP BY group?
Optimize groupwise maximum query
The query plan shows index names timestamp_ind and timestamp_sensor_ind. But indexes like that do not help with a search for a particular sensor.
To resolve an equals query (like sensor.id = data.sensor_id) the column has to be the first in the index. Try to add an index that allows searching on sensor_id and, within a sensor, is sorted by timestamp:
create index sensor_timestamp_ind on data(sensor_id, timestamp);
Does adding that index speed up the query?