I have a query in a function to select the top row and another for the last row, each query takes around 300ms to execute, and this query is executed a lot of times making the function useless
This is the query (this is a test, in the function parameters change):
SELECT the_geom
FROM "Entries"
WHERE taxiid= 366 and timestamp between '2008-02-06 16:00:00' and timestamp '2008-02-06 16:00:00' + interval '5 minutes'
ORDER BY entryid DESC
LIMIT 1;;
and this is the EXPLAIN ANALYZE output of the query:
QUERY PLAN
--------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------
Seq Scan on "Entries" (cost=0.00..63538.80 rows=70 width=51) (actual time=184.409..342.049 rows=56 loops=1)
Filter: (("timestamp" >= '2008-02-06 16:00:00'::timestamp without time zone) AND ("timestamp" <= '2008-02-06 16:05:00'::timestamp without time zone
) AND (taxiid = 366))
Rows Removed by Filter: 2128847
Planning time: 0.191 ms
Execution time: 342.088 ms
(5 rows)
Is there a better way of getting top and last row?
EDIT:
Thanks Drunix, that did help but, something that i cant understand is happening, with the index you suges i was able to go from ~300 ms to 0.2 ms
but if i change the time interval that is added to the timestamp to 120 minutes the index is not used and it keeps taking 300 ms
here is the proof(5 minute interval):
snowflake=# explain analyze Select the_geom from "Entries"
where taxiid= 366 and "timestamp" between '2008-02-06 16:00:00' and "timestamp" '2008-02-06 16:00:00' + interval '5 minutes'
ORDER BY entryid ASC
LIMIT 1;
QUERY PLAN
-------------------------------------------------------------------------
Limit (cost=149.52..149.52 rows=1 width=55) (actual time=0.129..0.129 rows=1 loops=1)
-> Sort (cost=149.52..149.70 rows=73 width=55) (actual time=0.127..0.127 rows=1 loops=1)
Sort Key: entryid
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using entriesindex on "Entries" (cost=0.43..149.15 rows=73 width=55) (actual time=0.045..0.090 rows=56 loops=1)
Index Cond: ((taxiid = 366) AND ("timestamp" >= '2008-02-06 16:00:00'::timestamp without time zone) AND ("timestamp" <= '2008-02-06 16:
05:00'::timestamp without time zone))
Planning time: 0.266 ms
Execution time: 0.180 ms
(8 rows)
the other one(120 minutes interval):
snowflake=# explain analyze Select the_geom from "Entries"
where taxiid= 366 and "timestamp" between '2008-02-06 16:00:00' and "timestamp" '2008-02-06 16:00:00' + interval '120 minutes'
ORDER BY entryid ASC
LIMIT 1;
QUERY PLAN
-------------------------------------------------------------------------
Limit (cost=0.43..60.02 rows=1 width=55) (actual time=245.570..245.570 rows=1 loops=1)
-> Index Scan using "Entries_pkey" on "Entries" (cost=0.43..97542.75 rows=1637 width=55) (actual time=245.568..245.568 rows=1 loops=1)
Filter: (("timestamp" >= '2008-02-06 16:00:00'::timestamp without time zone) AND ("timestamp" <= '2008-02-06 18:00:00'::timestamp without tim
e zone) AND (taxiid = 366))
Rows Removed by Filter: 853963
Planning time: 0.277 ms
Execution time: 245.616 ms
Ok, rephrasing my comment as an answer:
Unless you already have it you should create a composite index:
create index somename on Entries(taxiid, timestamp);
According to your execution plan the combination of these fields should be rather selective, therefore an index scan should be more efficient. Note that an index on (timestamp, taxiid) is probably much less useful, because it will only be used to limit the row by timestamp. Put the columns that are checked for equality in front in similar cases.
Related
How can I optimize this query ? I have created indexes,partitions,increased worker memory but the execution time is still 35s. How can I minimize it to 10-15 seconds?
Update :
Removed conversion of every time stamp from utc to local time i.e. time_stamp AT TIME ZONE 'utc' AT TIME ZONE which improved the performance by approximately 5 seconds. Current execution time : 36.5 seconds.
explain analyse select
DATE_TRUNC('day', time_stamp) as "time_stamp",
COUNT(DISTINCT id) AS alarm_count,
COUNT(DISTINCT patient_id) AS patient_count
FROM
alarm_management.alarm
WHERE
tenant_name = 'abc'
and
unit = ANY('{a,b,c,d,e,f,g,h,i,j,k}'::text[])
AND
time_stamp BETWEEN '2021-09-15 02:25:00' AND '2021-12-14 04:36:45'
AND
severity_label = ANY('{a,b,c,d}'::text[])
AND derived_label IS NOT NULL
GROUP by 1
Explain(analyze, verbose, buffers) output-
GroupAggregate (cost=3064683.77..3215681.44 rows=308821 width=24) (actual time=24242.730..35145.380 rows=91 loops=1)
Group Key: (date_trunc('day'::text, alarm_hospitalc_burn_2021_9.time_stamp))
-> Sort (cost=3064683.77..3101468.12 rows=14713740 width=40) (actual time=24167.513..25036.293 rows=16369464 loops=1)
Sort Key: (date_trunc('day'::text, alarm_hospitalc_burn_2021_9.time_stamp))
Sort Method: quicksort Memory: 1672081kB
-> Append (cost=0.00..1312964.42 rows=14713740 width=40) (actual time=0.308..20958.290 rows=16369464 loops=1)
-> Seq Scan on alarm_hospitalc_burn_2021_9 (cost=0.00..7175.10 rows=69691 width=40) (actual time=0.307..127.521 rows=94286 loops=1)
Filter: ((derived_label IS NOT NULL) AND (time_stamp >= '2021-09-15 02:25:00'::timestamp without time zone) AND (time_stamp <= '2021-12-14 04:36:45'::timestamp without time zone) AND (tenant_name = 'HospitalC'::text) AND (severity_label = ANY ('{"Short Yellow",Cyan,Red,Yellow}'::text[])) AND (unit = ANY ('{Burn,Delivery,EDI,EDT,EDW,ICU1,ICU2,ICU3P,ICU4P,PP,Tele}'::text[])))
The function can be written in SQL, that might be slightly faster:
CREATE OR REPLACE FUNCTION dbo.get_time_group ( _date_type TEXT )
RETURNS TEXT
LANGUAGE sql -- SQL is good enough
IMMUTABLE -- better for performance, next call is faster because of caching
AS
$$
SELECT CASE $1
WHEN 'hour' THEN 'hour'
ELSE 'day'
END;
$$;
But the most important thing is the query plan.
Let's say I have this simple query:
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
audiences a
WHERE
a.created_at >= (current_date - INTERVAL '5 days');
This is a 1GB+ table with a partial index on created_at column. When I run this query it does sequential scan and does not utilise my index which obviously takes much time:
Aggregate (cost=345853.43..345853.44 rows=1 width=8) (actual time=27126.426..27126.426 rows=1 loops=1)
-> Seq Scan on audiences a (cost=0.00..345840.46 rows=5188 width=0) (actual time=97.564..27124.317 rows=8029 loops=1)
Filter: (created_at >= (('now'::cstring)::date - '5 days'::interval))
Rows Removed by Filter: 2215612
Planning time: 0.131 ms
Execution time: 27126.458 ms
On the other hand if I'd have a "hardcoded" (or pre-calculated) value like this:
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
audiences a
WHERE
a.created_at >= TIMESTAMP '2020-10-16 00:00:00';
It would utilise an index on created_at:
Aggregate (cost=253.18..253.19 rows=1 width=8) (actual time=1014.655..1014.655 rows=1 loops=1)
-> Index Only Scan using index_audiences_on_created_at on audiences a (cost=0.29..240.21 rows=5188 width=0) (actual time=1.308..1011.071 rows=8029 loops=1)
Index Cond: (created_at >= '2020-10-16 00:00:00'::timestamp without time zone)
Heap Fetches: 6185
Planning time: 1.878 ms
Execution time: 1014.716 ms
If I could I'd just use an ORM and generate a query with the right value but I can't. Is there a way I can maybe pre-calculate this timestamp and use it in a WHERE clause via plain SQL?
Adding a little bit of tech info of my setup.
PostgreSQL version: 9.6.11
created_at column type is: timestamp
index: "index_audiences_on_created_at" btree (created_at) WHERE created_at > '2020-10-01 00:00:00'::timestamp without time zone
This is not the exact answer. But can do with specific situation
As you have the predicate (created_at > '2020-10-01 00:00:00'::timestamp without time zone) , if the filtering condition is greater than the predicate condition. Then you can prepend the condition in where
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
audiences a
WHERE
a.created_at >= TIMESTAMP '2020-10-16 00:00:00'
and
a.created_at >= (current_date - INTERVAL '5 days');
Note: may be instead of TIMESTAMP , you have to put TIMESTAMP without time zone or TIMESTAMP with time zone. Depends on column type
You have a partial index, and the optimizer is not smart enough to evaluate the expression in the where clause and then choose the partial index based on the expression's result.
So there is not much you can do, except creating an index without a WHERE clause.
From my experience, the best approach is to create a PL function. I think the problem is that it is evaluating this (current_date - INTERVAL '5 days') for every row.
So you would have to create a PL that evaluates it once and then use it for the query. Something like:
CREATE OR REPLACE FUNCTION count_audiences(
vinterval text -- to send interval dynamically
)
RETURNS INTEGER
AS $$
DECLARE
vdate timestamp;
vcount integer := 0;
BEGIN
EXECUTE 'SELECT current_date - INTERVAL '''||vinterval||'''' INTO vdate;-- obtain date
SELECT
COUNT(*)
INTO
vcount
FROM
audiences a
WHERE
a.created_at >= vdate;
RETURN vcount;
END;
$$ LANGUAGE plpgsql;
After creating the PL, you just have to call it like:
SELECT * FROM count_audiences('5 days');
In this way you can also use an ORM to call for the function.
Works here (given a usable index on created_at):
select version();
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
tweets a
WHERE
a.created_at >= (current_date - INTERVAL '5 days')
;
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
tweets a
WHERE
a.created_at >= TIMESTAMP '2020-10-16 00:00:00'
;
\d tweets
Output:
version
-------------------------------------------------------------------------------------------------------
PostgreSQL 11.3 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4, 64-bit
(1 row)
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2.90..2.91 rows=1 width=8) (actual time=0.100..0.101 rows=1 loops=1)
-> Index Only Scan using tweets_du_idx on tweets a (cost=0.56..2.90 rows=1 width=0) (actual time=0.088..0.088 rows=0 loops=1)
Index Cond: (created_at >= (CURRENT_DATE - '5 days'::interval))
Heap Fetches: 0
Planning Time: 2.357 ms
Execution Time: 0.217 ms
(6 rows)
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2.89..2.90 rows=1 width=8) (actual time=0.016..0.017 rows=1 loops=1)
-> Index Only Scan using tweets_du_idx on tweets a (cost=0.56..2.89 rows=1 width=0) (actual time=0.014..0.014 rows=0 loops=1)
Index Cond: (created_at >= '2020-10-16 00:00:00'::timestamp without time zone)
Heap Fetches: 0
Planning Time: 0.163 ms
Execution Time: 0.045 ms
(6 rows)
Table "public.tweets"
Column | Type | Collation | Nullable | Default
----------------+--------------------------+-----------+----------+---------
seq | bigint | | not null |
id | bigint | | not null |
user_id | bigint | | not null |
in_reply_to_id | bigint | | not null | 0
parent_seq | bigint | | not null | 0
sucker_id | integer | | not null | 0
created_at | timestamp with time zone | | |
[snip
body | text | | |
zoek | tsvector | | |
Indexes:
"tweets_pkey" PRIMARY KEY, btree (seq)
"tweets_id_key" UNIQUE CONSTRAINT, btree (id)
"tweets_stamp_idx" UNIQUE, btree (fetch_stamp, seq)
"tweets_userid_id" UNIQUE, btree (user_id, id)
"tweets_du_idx" btree (created_at, user_id)
"tweets_id_idx" btree (id) WHERE need_refetch = true
"tweets_in_reply_to_id_created_at_idx" btree (in_reply_to_id, created_at) WHERE is_retweet = false AND did_resolve = false AND in_reply_to_id > 0
"tweets_in_reply_to_id_fp" btree (in_reply_to_id)
"tweets_parent_seq_fk" btree (parent_seq)
"tweets_ud_idx" btree (user_id, created_at)
"tweets_zoek" gin (zoek)
Foreign-key constraints:
"tweets_parent_seq_fkey" FOREIGN KEY (parent_seq) REFERENCES tweets(seq)
"tweets_user_id_fkey" FOREIGN KEY (user_id) REFERENCES tweeps(id)
Referenced by:
TABLE "tweets" CONSTRAINT "tweets_parent_seq_fkey" FOREIGN KEY (parent_seq) REFERENCES tweets(seq)
Triggers:
tr_upd_zzoek_i BEFORE INSERT ON tweets FOR EACH ROW EXECUTE PROCEDURE tf_tweets_upd_zzoek()
tr_upd_zzoek_u BEFORE UPDATE ON tweets FOR EACH ROW WHEN (new.body <> old.body) EXECUTE PROCEDURE tf_tweets_upd_zzoek()
I run the following query and it takes 50 seconds.
select created_at, currency, balance
from YYY
where id in (ZZZ) and currency = 'XXX'
and created_at >= '2020-08-28'
order by created_at desc
limit 1;
explain:
Limit (cost=100.12..1439.97 rows=1 width=72)
-> Foreign Scan on yyy (cost=100.12..21537.65 rows=16 width=72)
Filter: (("substring"((object_key)::text, '\w+:(\d+):.*'::text))::integer = 723120)
Then I run the following query and it "infinite" time. Too long to wait until the end.
select created_at, currency, balance
from YYY
where id in (ZZZ) and currency = 'XXX'
and created_at >= NOW() - INTERVAL '1 DAY'
order by created_at desc
limit 1;
explain:
Limit (cost=53293831.90..53293831.91 rows=1 width=72)
-> Result (cost=53293831.90..53293987.46 rows=17284 width=72)
-> Sort (cost=53293831.90..53293840.54 rows=17284 width=556)
Sort Key: yyy.created_at DESC
-> Foreign Scan on yyy (cost=100.00..53293814.62 rows=17284 width=556)
Filter: ((created_at >= (now() - '1 day'::interval)) AND (("substring"((object_key)::text, '\w+:(\d+):.*'::text))::integer = 723120))
What could make this huge difference between those query. I know that index are used to improve performance. What can we infer from here?
Any contribution would be appreciated.
With a literal, the optimizer has an easy game to plan an efficient data access using the right index.
With an expression like NOW - INTERVAL '4 DAY', you run at least into two challenges:
It is a stable, not an immutable, expression. Let alone a literal.
The expression is a TIMESTAMP WITH TIME ZONE, not a DATE, and you need an implicit type cast.
You just make the life of the optimizer difficult ...
I just created a single-column table yyy with 12 years' worth of distinct dates in my PostgreSQL database. No indexes. Already here, you see a difference in the cost of the explain plan.
$ psql -c "explain select * from yyy where created_at >= '2020-08-28'"
QUERY PLAN
------------------------------------------------------
Seq Scan on yyy (cost=0.00..74.79 rows=126 width=4)
Filter: (created_at >= '2020-08-28'::date)
And:
$ psql -c "explain select * from yyy where created_at >= now() - interval '4 day'"
QUERY PLAN
--------------------------------------------------------
Seq Scan on yyy (cost=0.00..96.70 rows=126 width=4)
Filter: (created_at >= (now() - '4 days'::interval))
(2 rows)
It will be a much worse difference with the existence of an index ....
I was solving a problem why the same query in function took longer when executed via
RETURN QUERY EXECUTE text_query; (8s)
compare to
RETURN QUERY select ... (5s)
To my understanding it is because execute does not optimize.
So I compared explain analyze and found the problem. It was in one sequential scan where postgres optimized where condition. But the difference in time was way too big. So I separated it to the new query and it really was that different.
One select took 0.5s and second 4.2s, both returning exactly the same data. With such a difference I would expect that one were using index scan but it did not, both Sequential scan, I know that condition in where have also some cost but I never expected to have 8x the impact especially if I look at cost of the query it says 60434 compare to 87472. Not even close to 8x. (I ran it 10 times with same results)
Optimized where condition
select *
from table s_1
where
((stav >= 91) AND
(date_trunc('week'::text, zacatek) >= '2018-07-30 00:00:00'::timestamp without time zone)
AND (date_trunc('week'::text, zacatek) <= '2018-12-31 00:00:00'::timestamp without time zone))
Seq Scan on table s_1 (cost=0.00..60434.66 rows=5260 width=115) (actual time=0.112..555.299 rows=96725 loops=1)
Filter: ((s_1.stav >= 91) AND (date_trunc('week'::text, s_1.zacatek) >= '2018-07-30 00:00:00'::timestamp without time zone) AND (date_trunc('week'::text, s_1.zacatek) <= '2018-12-31 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 1255171
Buffers: shared hit=30017
Planning Time: 0.073 ms
Execution Time: 559.906 ms
Explain Analyze: https://explain.depesz.com/s/o7Gm
Non optimized where condition
select *
from table s_1
where
((stav >= 91) AND
(date_trunc('week'::text, zacatek) <= date_trunc('week'::text, date_trunc('week'::text, to_timestamp('2019-01-06 23:59:59'::text, 'YYYY-MM-DD HH24:mi:ss'::text))))
AND (date_trunc('week'::text, zacatek) >= date_trunc('week'::text, ((date_trunc('week'::text, to_timestamp('2018-12-31 00:00:00'::text, 'YYYY-MM-DD HH24:mi:ss'::text)) - '5 mons'::interval) + '1 day'::interval))))
Seq Scan on table s_1 (cost=0.00..87472.58 rows=5260 width=115) (actual time=0.958..4235.196 rows=96725 loops=1)
Filter: ((s_1.stav >= 91) AND (date_trunc('week'::text, s_1.zacatek) <= date_trunc('week'::text, date_trunc('week'::text, to_timestamp('2019-01-06 23:59:59'::text, 'YYYY-MM-DD HH24:mi:ss'::text)))) AND (date_trunc('week'::text, s_1.zacatek) >= date_trunc('week'::text, ((date_trunc('week'::text, to_timestamp('2018-12-31 00:00:00'::text, 'YYYY-MM-DD HH24:mi:ss'::text)) - '5 mons'::interval) + '1 day'::interval))))
Rows Removed by Filter: 1255171
Buffers: shared hit=30017
Planning Time: 0.166 ms
Execution Time: 4241.479 ms
Explain Analyze: https://explain.depesz.com/s/5Rkv
My question is: why is there such a big difference without any obvious reasons, method is the same, number of returned rows is the same and expected cost is also rather similar.
PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-36), 64-bit
I have a table, which contains a timestamp column and a source column varchar(20). I insert a couple thousand entries into this table every hour and I would like to show an aggregate on this data. My query looks like this:
EXPLAIN (analyze, buffers) SELECT
count(*) AS count
FROM frontend_car c
WHERE date_created at time zone 'cet' > now() at time zone 'cet' - interval '1 week'
GROUP BY source, date_trunc('hour', c.date_created at time zone 'CET')
ORDER BY source ASC, date_trunc('hour', c.date_created at time zone 'CET') DESC
I have already created an index like so:
create index source_date_created on
table_name(
(date_created AT TIME ZONE 'CET') DESC,
source ASC,
date_trunc('hour', date_created at time zone 'CET') DESC
);
And the output of my ANALYZE is:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=142888.08..142889.32 rows=495 width=16) (actual time=10242.141..10242.188 rows=494 loops=1)
Sort Key: source, (date_trunc('hour'::text, timezone('CET'::text, date_created)))
Sort Method: quicksort Memory: 63kB
Buffers: shared hit=27575 read=28482
-> HashAggregate (cost=142858.50..142865.93 rows=495 width=16) (actual time=10236.393..10236.516 rows=494 loops=1)
Group Key: source, date_trunc('hour'::text, timezone('CET'::text, date_created))
Buffers: shared hit=27575 read=28482
-> Bitmap Heap Scan on frontend_car c (cost=7654.61..141002.20 rows=247507 width=16) (actual time=427.894..10122.438 rows=249056 loops=1)
Recheck Cond: (timezone('cet'::text, date_created) > (timezone('cet'::text, now()) - '7 days'::interval))
Rows Removed by Index Recheck: 141143
Heap Blocks: exact=27878 lossy=26713
Buffers: shared hit=27575 read=28482
-> Bitmap Index Scan on frontend_car_source_date_created (cost=0.00..7592.74 rows=247507 width=0) (actual time=420.415..420.415 rows=249056 loops=1)
Index Cond: (timezone('cet'::text, date_created) > (timezone('cet'::text, now()) - '7 days'::interval))
Buffers: shared hit=3 read=1463
Planning time: 2.430 ms
Execution time: 10242.379 ms
(17 rows)
Clearly this is too slow and in my mind it should be able to be computed only using indexes, if I use only either time or source for aggregation, it is reasonably fast, but together somehow its slow.
This is on a rather small VPS with only 512MB of ram and the database presently contains about 700k rows.
From what I read it seems that the majority of time is spent on recheck, which means that the index did not fit in memory?
It sounds like what you really need is a separate aggregate table that gets records inserted or updated via a trigger in your detailed table. The summary table would have your source column, a date/time field to hold only the date and hour portion (truncating any minutes), and finally the running count.
As records are inserted, this summary table gets updated, then your query could be directly on this table. Since it will already be pre-aggregated by source, date and hour, your query would just need to apply the where clause and order that by the source.
I'm not fluent at all with postgresql, but am sure they have their own means of insert triggers. So, if you have 1000s of entries each hour, and say you have 10 sources. Your entire result set of this aggregate summary table would only be 24(hrs) * ex 50(sources) = 1200 records per day vs 50k, 60k, 70k+ per day. If you then need the exact details per a given date/hour basis, you could then drill-into the details on an as-needed basis. But really, how many "sources" are you dealing with is unclear.
I would strongly consider this as a solution to your needs.