Query optimization- How to achieve that in this query? - sql

How can I optimize this query ? I have created indexes,partitions,increased worker memory but the execution time is still 35s. How can I minimize it to 10-15 seconds?
Update :
Removed conversion of every time stamp from utc to local time i.e. time_stamp AT TIME ZONE 'utc' AT TIME ZONE which improved the performance by approximately 5 seconds. Current execution time : 36.5 seconds.
explain analyse select
DATE_TRUNC('day', time_stamp) as "time_stamp",
COUNT(DISTINCT id) AS alarm_count,
COUNT(DISTINCT patient_id) AS patient_count
FROM
alarm_management.alarm
WHERE
tenant_name = 'abc'
and
unit = ANY('{a,b,c,d,e,f,g,h,i,j,k}'::text[])
AND
time_stamp BETWEEN '2021-09-15 02:25:00' AND '2021-12-14 04:36:45'
AND
severity_label = ANY('{a,b,c,d}'::text[])
AND derived_label IS NOT NULL
GROUP by 1
Explain(analyze, verbose, buffers) output-
GroupAggregate (cost=3064683.77..3215681.44 rows=308821 width=24) (actual time=24242.730..35145.380 rows=91 loops=1)
Group Key: (date_trunc('day'::text, alarm_hospitalc_burn_2021_9.time_stamp))
-> Sort (cost=3064683.77..3101468.12 rows=14713740 width=40) (actual time=24167.513..25036.293 rows=16369464 loops=1)
Sort Key: (date_trunc('day'::text, alarm_hospitalc_burn_2021_9.time_stamp))
Sort Method: quicksort Memory: 1672081kB
-> Append (cost=0.00..1312964.42 rows=14713740 width=40) (actual time=0.308..20958.290 rows=16369464 loops=1)
-> Seq Scan on alarm_hospitalc_burn_2021_9 (cost=0.00..7175.10 rows=69691 width=40) (actual time=0.307..127.521 rows=94286 loops=1)
Filter: ((derived_label IS NOT NULL) AND (time_stamp >= '2021-09-15 02:25:00'::timestamp without time zone) AND (time_stamp <= '2021-12-14 04:36:45'::timestamp without time zone) AND (tenant_name = 'HospitalC'::text) AND (severity_label = ANY ('{"Short Yellow",Cyan,Red,Yellow}'::text[])) AND (unit = ANY ('{Burn,Delivery,EDI,EDT,EDW,ICU1,ICU2,ICU3P,ICU4P,PP,Tele}'::text[])))

The function can be written in SQL, that might be slightly faster:
CREATE OR REPLACE FUNCTION dbo.get_time_group ( _date_type TEXT )
RETURNS TEXT
LANGUAGE sql -- SQL is good enough
IMMUTABLE -- better for performance, next call is faster because of caching
AS
$$
SELECT CASE $1
WHEN 'hour' THEN 'hour'
ELSE 'day'
END;
$$;
But the most important thing is the query plan.

Related

PostgreSQL: get latest row for each time interval

I have the following table. It is stored as a TimescaleDB hypertable. Data rate is 1 row per second.
CREATE TABLE electricity_data
(
"time" timestamptz NOT NULL,
meter_id integer REFERENCES meters NOT NULL,
import_low double precision,
import_normal double precision,
export_low double precision,
export_normal double precision,
PRIMARY KEY ("time", meter_id)
)
I would like to get the latest row in a given time interval, over a period of time.
For instance the latest record each month for the previous year.
The following query works but is slow:
EXPLAIN ANALYZE
SELECT
DISTINCT ON (bucket)
time_bucket('1 month', "time", 'Europe/Amsterdam') AS bucket,
import_low,
import_normal,
export_low,
export_normal
FROM electricity_data
WHERE meter_id = 1
AND "time" BETWEEN '2022-01-01T00:00:00 Europe/Amsterdam' AND '2023-01-01T00:00:00 Europe/Amsterdam'
ORDER BY bucket DESC
Unique (cost=0.42..542380.99 rows=200 width=40) (actual time=3654.263..59130.398 rows=12 loops=1)
-> Custom Scan (ChunkAppend) on electricity_data (cost=0.42..514045.41 rows=11334231 width=40) (actual time=3654.260..58255.396 rows=11161474 loops=1)
Order: time_bucket('1 mon'::interval, electricity_data.""time"", 'Europe/Amsterdam'::text, NULL::timestamp with time zone, NULL::interval) DESC
-> Index Scan using _hyper_12_1533_chunk_electricity_data_time_idx on _hyper_12_1533_chunk (cost=0.42..11530.51 rows=255951 width=40) (actual time=3654.253..3986.885 rows=255582 loops=1)
Index Cond: ((""time"" >= '2021-12-31 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
Filter: (meter_id = 1)
Rows Removed by Filter: 24330
-> Index Scan Backward using ""1529_1849_electricity_data_pkey"" on _hyper_12_1529_chunk (cost=0.42..25777.81 rows=604553 width=40) (actual time=1.468..1810.493 rows=603808 loops=1)
Index Cond: ((""time"" >= '2021-12-31 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone) AND (meter_id = 1))
(...)
Planning Time: 57.424 ms
JIT:
Functions: 217
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 43.496 ms, Inlining 18.805 ms, Optimization 2348.206 ms, Emission 1288.087 ms, Total 3698.594 ms
Execution Time: 59176.016 ms
Getting the latest row for a single month is instantaneous:
EXPLAIN ANALYZE
SELECT
"time",
import_low,
import_normal,
export_low,
export_normal
FROM electricity_data
WHERE meter_id = 1
AND "time" BETWEEN '2022-12-01T00:00:00 Europe/Amsterdam' AND '2023-01-01T00:00:00 Europe/Amsterdam'
ORDER BY "time" DESC
LIMIT 1
Limit (cost=0.42..0.47 rows=1 width=40) (actual time=0.048..0.050 rows=1 loops=1)
-> Custom Scan (ChunkAppend) on electricity_data (cost=0.42..11530.51 rows=255951 width=40) (actual time=0.047..0.048 rows=1 loops=1)
Order: electricity_data.""time"" DESC
-> Index Scan using _hyper_12_1533_chunk_electricity_data_time_idx on _hyper_12_1533_chunk (cost=0.42..11530.51 rows=255951 width=40) (actual time=0.046..0.046 rows=1 loops=1)
Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
Filter: (meter_id = 1)
-> Index Scan Backward using ""1529_1849_electricity_data_pkey"" on _hyper_12_1529_chunk (cost=0.42..25777.81 rows=604553 width=40) (never executed)
Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone) AND (meter_id = 1))
(...)
-> Index Scan using _hyper_12_1512_chunk_electricity_data_time_idx on _hyper_12_1512_chunk (cost=0.42..8.94 rows=174 width=40) (never executed)
Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
Filter: (meter_id = 1)
Planning Time: 2.162 ms
Execution Time: 0.152 ms
Is there a way to execute the query above for each month or custom time interval? Or is there a different way to speed up the first query?
Edit
The following query takes 10 seconds, which is much better, but still slower than the manual approach. An index does not seem to make a difference.
EXPLAIN ANALYZE
SELECT MAX("time") AS "time"
FROM electricity_data
WHERE meter_id = 1
AND "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
GROUP BY time_bucket('1 month', "time", 'Europe/Amsterdam');
(... plan removed)
Planning Time: 50.463 ms
JIT:
Functions: 451
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 76.476 ms, Inlining 0.000 ms, Optimization 13.849 ms, Emission 416.718 ms, Total 507.043 ms
Execution Time: 9910.058 ms
I'd recommend using the last aggregate and a continuous aggregate to solve this problem.
Like the previous poster, I'd also recommend an index on meter, time rather than the other way around, you can do this in your table definition by just changing the order of keys in your primary key definition.
CREATE TABLE electricity_data
(
"time" timestamptz NOT NULL,
meter_id integer REFERENCES meters NOT NULL,
import_low double precision,
import_normal double precision,
export_low double precision,
export_normal double precision,
PRIMARY KEY ( meter_id, "time")
);
But that's a bit off topic. The basic query you'll want to do is something like:
SELECT time_bucket('1 day', "time", 'Europe/Amsterdam'),
meter_id,
last(electricity_data, "time")
FROM electricity_data
GROUP BY 1, 2;
This is a bit confusing until you realize that the table itself is also a type in PostgreSQL - so you can ask for and return a composite type from this call to the last aggregate, which will get the latest value in the month or day or whatever you want.
Then you have to be able to treat that as a row again, so you can expand that by using parentheses and a .* which is how composite types can be expanded in PG.
SELECT time_bucket('1 month', "time", 'Europe/Amsterdam'),
meter_id,
(last(electricity_data, "time")).*
FROM electricity_data
GROUP BY 1,2;
Now, in order to speed things up, you can turn that into a continuous aggregate which will make things much faster.
CREATE MATERIALIZED VIEW last_meter_month WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month', "time", 'Europe/Amsterdam'),
(last(electricity_data, "time")).*
FROM electricity_data
GROUP BY 1, meter_id;
You'll note that I took the meter_id out of the initial select list because that's gonna come from our composite type and I don't need the redundant column, nor can I have two columns with the same name in a view, but I did keep meter_id in my group by.
So that'll speed things up nicely, but, if I were you, I might actually think about doing this on a daily basis and creating a hierarchical continuous aggregate for this type of thing.
CREATE MATERIALIZED VIEW last_meter_day WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', "time", 'Europe/Amsterdam'),
(last(electricity_data, "time")).*
FROM electricity_data
GROUP BY 1, meter_id;
CREATE MATERIALIZED VIEW last_meter_month WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month',time_bucket, 'Europe/Amsterdam') as month_bucket,
(last(last_meter_day, time_bucket)).*
FROM last_meter_day
GROUP BY 1, meter_id;
The reason for that is that we can't really refresh a monthly continuous aggregate all that often, it's much easier to refresh a daily aggregate and then roll that up into a monthly aggregate more frequently. You could also just have the daily aggregate and roll up to month on the fly in your query as that would be at most 30 days per meter, but of course that won't be as performant.
You'll then have to create continuous aggregate policies for these based on what you want to have happen on refresh.
I'd also suggest, depending on what you're trying to do with this, that you might want to take a look at counter_agg as it might be useful for you. I also recently wrote a post in our forum about how to use it with electricity meters that might be helpful for you depending on how you're processing this data.
You can try an approach that uses a subquery to get the timestamp of the latest time in each bucket. Then, join that to your detail table.
SELECT meter_id, MAX("time") "time"
FROM electricity_data
WHERE "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
GROUP BY meter_id,
time_bucket('1 month', "time", 'Europe/Amsterdam')
That gets you a virtual table with the latest time for each meter for each time bucket (month in this case). It can be accelerated with this index, the same as your primary key but with the columns in the opposite order. With the columns in that order the query can be satisfied with a relatively quick index scan.
CREATE INDEX meter_time ON electricity_data (meter_id, "time")
Then join that to your detail table. Like this.
SELECT d.meter_id
time_bucket('1 month', d."time", 'Europe/Amsterdam') AS bucket,
d."time",
d.import_low,
d.import_normal,
d.export_low,
d.export_normal
FROM electricity_data d
JOIN (
SELECT meter_id, MAX("time") "time"
FROM electricity_data
WHERE "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
GROUP BY meter_id,
time_bucket('1 month', "time", 'Europe/Amsterdam')
) last ON d."time" = last."time"
AND d.meter_id = last.meter_id
ORDER BY d.meter_id, bucket DESC
(I'm not completely sure of the syntax in TimeScaleDB for columns that have the same name as reserved words like time, so this isn't tested.)
If you want just one meter, put a WHERE clause right before the last ORDER BY clause.
The other answers are likely more useful in most cases.
I wanted a solution that works for any interval,
without the need for continuous aggregates.
I ended up with the following query, using a lateral join. I use the lag function to compute energy consumption/generation in a time bucket (omitted below). Variables $__interval, $__timeFrom() and $__timeTo() specify the chosen bucket interval and time range.
SELECT bucket, import_low, import_normal, export_low, export_normal
FROM (
SELECT
tstzrange(
-- Could also use date_trunc or date_bin
time_bucket(INTERVAL '$__interval', d, 'Europe/Amsterdam'),
time_bucket(INTERVAL '$__interval', d + INTERVAL '$__interval', 'Europe/Amsterdam'),
'(]' -- We use an inclusive upper bound, because a meter reading on the upper boundary applies to the previous period
) bucket
FROM generate_series($__timeFrom(), $__timeTo(), INTERVAL '$__interval') d
) buckets
LEFT JOIN LATERAL (
SELECT *
FROM electricity_data
WHERE meter_id = $meterId AND "time" <# bucket
ORDER BY "time" DESC
LIMIT 1
) elec ON true
ORDER BY bucket;

How can I utilize a partial index for the calculated filter condition in a where clause?

Let's say I have this simple query:
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
audiences a
WHERE
a.created_at >= (current_date - INTERVAL '5 days');
This is a 1GB+ table with a partial index on created_at column. When I run this query it does sequential scan and does not utilise my index which obviously takes much time:
Aggregate (cost=345853.43..345853.44 rows=1 width=8) (actual time=27126.426..27126.426 rows=1 loops=1)
-> Seq Scan on audiences a (cost=0.00..345840.46 rows=5188 width=0) (actual time=97.564..27124.317 rows=8029 loops=1)
Filter: (created_at >= (('now'::cstring)::date - '5 days'::interval))
Rows Removed by Filter: 2215612
Planning time: 0.131 ms
Execution time: 27126.458 ms
On the other hand if I'd have a "hardcoded" (or pre-calculated) value like this:
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
audiences a
WHERE
a.created_at >= TIMESTAMP '2020-10-16 00:00:00';
It would utilise an index on created_at:
Aggregate (cost=253.18..253.19 rows=1 width=8) (actual time=1014.655..1014.655 rows=1 loops=1)
-> Index Only Scan using index_audiences_on_created_at on audiences a (cost=0.29..240.21 rows=5188 width=0) (actual time=1.308..1011.071 rows=8029 loops=1)
Index Cond: (created_at >= '2020-10-16 00:00:00'::timestamp without time zone)
Heap Fetches: 6185
Planning time: 1.878 ms
Execution time: 1014.716 ms
If I could I'd just use an ORM and generate a query with the right value but I can't. Is there a way I can maybe pre-calculate this timestamp and use it in a WHERE clause via plain SQL?
Adding a little bit of tech info of my setup.
PostgreSQL version: 9.6.11
created_at column type is: timestamp
index: "index_audiences_on_created_at" btree (created_at) WHERE created_at > '2020-10-01 00:00:00'::timestamp without time zone
This is not the exact answer. But can do with specific situation
As you have the predicate (created_at > '2020-10-01 00:00:00'::timestamp without time zone) , if the filtering condition is greater than the predicate condition. Then you can prepend the condition in where
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
audiences a
WHERE
a.created_at >= TIMESTAMP '2020-10-16 00:00:00'
and
a.created_at >= (current_date - INTERVAL '5 days');
Note: may be instead of TIMESTAMP , you have to put TIMESTAMP without time zone or TIMESTAMP with time zone. Depends on column type
You have a partial index, and the optimizer is not smart enough to evaluate the expression in the where clause and then choose the partial index based on the expression's result.
So there is not much you can do, except creating an index without a WHERE clause.
From my experience, the best approach is to create a PL function. I think the problem is that it is evaluating this (current_date - INTERVAL '5 days') for every row.
So you would have to create a PL that evaluates it once and then use it for the query. Something like:
CREATE OR REPLACE FUNCTION count_audiences(
vinterval text -- to send interval dynamically
)
RETURNS INTEGER
AS $$
DECLARE
vdate timestamp;
vcount integer := 0;
BEGIN
EXECUTE 'SELECT current_date - INTERVAL '''||vinterval||'''' INTO vdate;-- obtain date
SELECT
COUNT(*)
INTO
vcount
FROM
audiences a
WHERE
a.created_at >= vdate;
RETURN vcount;
END;
$$ LANGUAGE plpgsql;
After creating the PL, you just have to call it like:
SELECT * FROM count_audiences('5 days');
In this way you can also use an ORM to call for the function.
Works here (given a usable index on created_at):
select version();
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
tweets a
WHERE
a.created_at >= (current_date - INTERVAL '5 days')
;
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
tweets a
WHERE
a.created_at >= TIMESTAMP '2020-10-16 00:00:00'
;
\d tweets
Output:
version
-------------------------------------------------------------------------------------------------------
PostgreSQL 11.3 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4, 64-bit
(1 row)
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2.90..2.91 rows=1 width=8) (actual time=0.100..0.101 rows=1 loops=1)
-> Index Only Scan using tweets_du_idx on tweets a (cost=0.56..2.90 rows=1 width=0) (actual time=0.088..0.088 rows=0 loops=1)
Index Cond: (created_at >= (CURRENT_DATE - '5 days'::interval))
Heap Fetches: 0
Planning Time: 2.357 ms
Execution Time: 0.217 ms
(6 rows)
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2.89..2.90 rows=1 width=8) (actual time=0.016..0.017 rows=1 loops=1)
-> Index Only Scan using tweets_du_idx on tweets a (cost=0.56..2.89 rows=1 width=0) (actual time=0.014..0.014 rows=0 loops=1)
Index Cond: (created_at >= '2020-10-16 00:00:00'::timestamp without time zone)
Heap Fetches: 0
Planning Time: 0.163 ms
Execution Time: 0.045 ms
(6 rows)
Table "public.tweets"
Column | Type | Collation | Nullable | Default
----------------+--------------------------+-----------+----------+---------
seq | bigint | | not null |
id | bigint | | not null |
user_id | bigint | | not null |
in_reply_to_id | bigint | | not null | 0
parent_seq | bigint | | not null | 0
sucker_id | integer | | not null | 0
created_at | timestamp with time zone | | |
[snip
body | text | | |
zoek | tsvector | | |
Indexes:
"tweets_pkey" PRIMARY KEY, btree (seq)
"tweets_id_key" UNIQUE CONSTRAINT, btree (id)
"tweets_stamp_idx" UNIQUE, btree (fetch_stamp, seq)
"tweets_userid_id" UNIQUE, btree (user_id, id)
"tweets_du_idx" btree (created_at, user_id)
"tweets_id_idx" btree (id) WHERE need_refetch = true
"tweets_in_reply_to_id_created_at_idx" btree (in_reply_to_id, created_at) WHERE is_retweet = false AND did_resolve = false AND in_reply_to_id > 0
"tweets_in_reply_to_id_fp" btree (in_reply_to_id)
"tweets_parent_seq_fk" btree (parent_seq)
"tweets_ud_idx" btree (user_id, created_at)
"tweets_zoek" gin (zoek)
Foreign-key constraints:
"tweets_parent_seq_fkey" FOREIGN KEY (parent_seq) REFERENCES tweets(seq)
"tweets_user_id_fkey" FOREIGN KEY (user_id) REFERENCES tweeps(id)
Referenced by:
TABLE "tweets" CONSTRAINT "tweets_parent_seq_fkey" FOREIGN KEY (parent_seq) REFERENCES tweets(seq)
Triggers:
tr_upd_zzoek_i BEFORE INSERT ON tweets FOR EACH ROW EXECUTE PROCEDURE tf_tweets_upd_zzoek()
tr_upd_zzoek_u BEFORE UPDATE ON tweets FOR EACH ROW WHEN (new.body <> old.body) EXECUTE PROCEDURE tf_tweets_upd_zzoek()

SQL relative versus absolute date impact query time

I run the following query and it takes 50 seconds.
select created_at, currency, balance
from YYY
where id in (ZZZ) and currency = 'XXX'
and created_at >= '2020-08-28'
order by created_at desc
limit 1;
explain:
Limit (cost=100.12..1439.97 rows=1 width=72)
-> Foreign Scan on yyy (cost=100.12..21537.65 rows=16 width=72)
Filter: (("substring"((object_key)::text, '\w+:(\d+):.*'::text))::integer = 723120)
Then I run the following query and it "infinite" time. Too long to wait until the end.
select created_at, currency, balance
from YYY
where id in (ZZZ) and currency = 'XXX'
and created_at >= NOW() - INTERVAL '1 DAY'
order by created_at desc
limit 1;
explain:
Limit (cost=53293831.90..53293831.91 rows=1 width=72)
-> Result (cost=53293831.90..53293987.46 rows=17284 width=72)
-> Sort (cost=53293831.90..53293840.54 rows=17284 width=556)
Sort Key: yyy.created_at DESC
-> Foreign Scan on yyy (cost=100.00..53293814.62 rows=17284 width=556)
Filter: ((created_at >= (now() - '1 day'::interval)) AND (("substring"((object_key)::text, '\w+:(\d+):.*'::text))::integer = 723120))
What could make this huge difference between those query. I know that index are used to improve performance. What can we infer from here?
Any contribution would be appreciated.
With a literal, the optimizer has an easy game to plan an efficient data access using the right index.
With an expression like NOW - INTERVAL '4 DAY', you run at least into two challenges:
It is a stable, not an immutable, expression. Let alone a literal.
The expression is a TIMESTAMP WITH TIME ZONE, not a DATE, and you need an implicit type cast.
You just make the life of the optimizer difficult ...
I just created a single-column table yyy with 12 years' worth of distinct dates in my PostgreSQL database. No indexes. Already here, you see a difference in the cost of the explain plan.
$ psql -c "explain select * from yyy where created_at >= '2020-08-28'"
QUERY PLAN
------------------------------------------------------
Seq Scan on yyy (cost=0.00..74.79 rows=126 width=4)
Filter: (created_at >= '2020-08-28'::date)
And:
$ psql -c "explain select * from yyy where created_at >= now() - interval '4 day'"
QUERY PLAN
--------------------------------------------------------
Seq Scan on yyy (cost=0.00..96.70 rows=126 width=4)
Filter: (created_at >= (now() - '4 days'::interval))
(2 rows)
It will be a much worse difference with the existence of an index ....

PostgreSQL aggregate query is very slow

I have a table, which contains a timestamp column and a source column varchar(20). I insert a couple thousand entries into this table every hour and I would like to show an aggregate on this data. My query looks like this:
EXPLAIN (analyze, buffers) SELECT
count(*) AS count
FROM frontend_car c
WHERE date_created at time zone 'cet' > now() at time zone 'cet' - interval '1 week'
GROUP BY source, date_trunc('hour', c.date_created at time zone 'CET')
ORDER BY source ASC, date_trunc('hour', c.date_created at time zone 'CET') DESC
I have already created an index like so:
create index source_date_created on
table_name(
(date_created AT TIME ZONE 'CET') DESC,
source ASC,
date_trunc('hour', date_created at time zone 'CET') DESC
);
And the output of my ANALYZE is:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=142888.08..142889.32 rows=495 width=16) (actual time=10242.141..10242.188 rows=494 loops=1)
Sort Key: source, (date_trunc('hour'::text, timezone('CET'::text, date_created)))
Sort Method: quicksort Memory: 63kB
Buffers: shared hit=27575 read=28482
-> HashAggregate (cost=142858.50..142865.93 rows=495 width=16) (actual time=10236.393..10236.516 rows=494 loops=1)
Group Key: source, date_trunc('hour'::text, timezone('CET'::text, date_created))
Buffers: shared hit=27575 read=28482
-> Bitmap Heap Scan on frontend_car c (cost=7654.61..141002.20 rows=247507 width=16) (actual time=427.894..10122.438 rows=249056 loops=1)
Recheck Cond: (timezone('cet'::text, date_created) > (timezone('cet'::text, now()) - '7 days'::interval))
Rows Removed by Index Recheck: 141143
Heap Blocks: exact=27878 lossy=26713
Buffers: shared hit=27575 read=28482
-> Bitmap Index Scan on frontend_car_source_date_created (cost=0.00..7592.74 rows=247507 width=0) (actual time=420.415..420.415 rows=249056 loops=1)
Index Cond: (timezone('cet'::text, date_created) > (timezone('cet'::text, now()) - '7 days'::interval))
Buffers: shared hit=3 read=1463
Planning time: 2.430 ms
Execution time: 10242.379 ms
(17 rows)
Clearly this is too slow and in my mind it should be able to be computed only using indexes, if I use only either time or source for aggregation, it is reasonably fast, but together somehow its slow.
This is on a rather small VPS with only 512MB of ram and the database presently contains about 700k rows.
From what I read it seems that the majority of time is spent on recheck, which means that the index did not fit in memory?
It sounds like what you really need is a separate aggregate table that gets records inserted or updated via a trigger in your detailed table. The summary table would have your source column, a date/time field to hold only the date and hour portion (truncating any minutes), and finally the running count.
As records are inserted, this summary table gets updated, then your query could be directly on this table. Since it will already be pre-aggregated by source, date and hour, your query would just need to apply the where clause and order that by the source.
I'm not fluent at all with postgresql, but am sure they have their own means of insert triggers. So, if you have 1000s of entries each hour, and say you have 10 sources. Your entire result set of this aggregate summary table would only be 24(hrs) * ex 50(sources) = 1200 records per day vs 50k, 60k, 70k+ per day. If you then need the exact details per a given date/hour basis, you could then drill-into the details on an as-needed basis. But really, how many "sources" are you dealing with is unclear.
I would strongly consider this as a solution to your needs.

Poor performance on simple query

I have a query in a function to select the top row and another for the last row, each query takes around 300ms to execute, and this query is executed a lot of times making the function useless
This is the query (this is a test, in the function parameters change):
SELECT the_geom
FROM "Entries"
WHERE taxiid= 366 and timestamp between '2008-02-06 16:00:00' and timestamp '2008-02-06 16:00:00' + interval '5 minutes'
ORDER BY entryid DESC
LIMIT 1;;
and this is the EXPLAIN ANALYZE output of the query:
QUERY PLAN
--------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------
Seq Scan on "Entries" (cost=0.00..63538.80 rows=70 width=51) (actual time=184.409..342.049 rows=56 loops=1)
Filter: (("timestamp" >= '2008-02-06 16:00:00'::timestamp without time zone) AND ("timestamp" <= '2008-02-06 16:05:00'::timestamp without time zone
) AND (taxiid = 366))
Rows Removed by Filter: 2128847
Planning time: 0.191 ms
Execution time: 342.088 ms
(5 rows)
Is there a better way of getting top and last row?
EDIT:
Thanks Drunix, that did help but, something that i cant understand is happening, with the index you suges i was able to go from ~300 ms to 0.2 ms
but if i change the time interval that is added to the timestamp to 120 minutes the index is not used and it keeps taking 300 ms
here is the proof(5 minute interval):
snowflake=# explain analyze Select the_geom from "Entries"
where taxiid= 366 and "timestamp" between '2008-02-06 16:00:00' and "timestamp" '2008-02-06 16:00:00' + interval '5 minutes'
ORDER BY entryid ASC
LIMIT 1;
QUERY PLAN
-------------------------------------------------------------------------
Limit (cost=149.52..149.52 rows=1 width=55) (actual time=0.129..0.129 rows=1 loops=1)
-> Sort (cost=149.52..149.70 rows=73 width=55) (actual time=0.127..0.127 rows=1 loops=1)
Sort Key: entryid
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using entriesindex on "Entries" (cost=0.43..149.15 rows=73 width=55) (actual time=0.045..0.090 rows=56 loops=1)
Index Cond: ((taxiid = 366) AND ("timestamp" >= '2008-02-06 16:00:00'::timestamp without time zone) AND ("timestamp" <= '2008-02-06 16:
05:00'::timestamp without time zone))
Planning time: 0.266 ms
Execution time: 0.180 ms
(8 rows)
the other one(120 minutes interval):
snowflake=# explain analyze Select the_geom from "Entries"
where taxiid= 366 and "timestamp" between '2008-02-06 16:00:00' and "timestamp" '2008-02-06 16:00:00' + interval '120 minutes'
ORDER BY entryid ASC
LIMIT 1;
QUERY PLAN
-------------------------------------------------------------------------
Limit (cost=0.43..60.02 rows=1 width=55) (actual time=245.570..245.570 rows=1 loops=1)
-> Index Scan using "Entries_pkey" on "Entries" (cost=0.43..97542.75 rows=1637 width=55) (actual time=245.568..245.568 rows=1 loops=1)
Filter: (("timestamp" >= '2008-02-06 16:00:00'::timestamp without time zone) AND ("timestamp" <= '2008-02-06 18:00:00'::timestamp without tim
e zone) AND (taxiid = 366))
Rows Removed by Filter: 853963
Planning time: 0.277 ms
Execution time: 245.616 ms
Ok, rephrasing my comment as an answer:
Unless you already have it you should create a composite index:
create index somename on Entries(taxiid, timestamp);
According to your execution plan the combination of these fields should be rather selective, therefore an index scan should be more efficient. Note that an index on (timestamp, taxiid) is probably much less useful, because it will only be used to limit the row by timestamp. Put the columns that are checked for equality in front in similar cases.