SQL relative versus absolute date impact query time - sql

I run the following query and it takes 50 seconds.
select created_at, currency, balance
from YYY
where id in (ZZZ) and currency = 'XXX'
and created_at >= '2020-08-28'
order by created_at desc
limit 1;
explain:
Limit (cost=100.12..1439.97 rows=1 width=72)
-> Foreign Scan on yyy (cost=100.12..21537.65 rows=16 width=72)
Filter: (("substring"((object_key)::text, '\w+:(\d+):.*'::text))::integer = 723120)
Then I run the following query and it "infinite" time. Too long to wait until the end.
select created_at, currency, balance
from YYY
where id in (ZZZ) and currency = 'XXX'
and created_at >= NOW() - INTERVAL '1 DAY'
order by created_at desc
limit 1;
explain:
Limit (cost=53293831.90..53293831.91 rows=1 width=72)
-> Result (cost=53293831.90..53293987.46 rows=17284 width=72)
-> Sort (cost=53293831.90..53293840.54 rows=17284 width=556)
Sort Key: yyy.created_at DESC
-> Foreign Scan on yyy (cost=100.00..53293814.62 rows=17284 width=556)
Filter: ((created_at >= (now() - '1 day'::interval)) AND (("substring"((object_key)::text, '\w+:(\d+):.*'::text))::integer = 723120))
What could make this huge difference between those query. I know that index are used to improve performance. What can we infer from here?
Any contribution would be appreciated.

With a literal, the optimizer has an easy game to plan an efficient data access using the right index.
With an expression like NOW - INTERVAL '4 DAY', you run at least into two challenges:
It is a stable, not an immutable, expression. Let alone a literal.
The expression is a TIMESTAMP WITH TIME ZONE, not a DATE, and you need an implicit type cast.
You just make the life of the optimizer difficult ...
I just created a single-column table yyy with 12 years' worth of distinct dates in my PostgreSQL database. No indexes. Already here, you see a difference in the cost of the explain plan.
$ psql -c "explain select * from yyy where created_at >= '2020-08-28'"
QUERY PLAN
------------------------------------------------------
Seq Scan on yyy (cost=0.00..74.79 rows=126 width=4)
Filter: (created_at >= '2020-08-28'::date)
And:
$ psql -c "explain select * from yyy where created_at >= now() - interval '4 day'"
QUERY PLAN
--------------------------------------------------------
Seq Scan on yyy (cost=0.00..96.70 rows=126 width=4)
Filter: (created_at >= (now() - '4 days'::interval))
(2 rows)
It will be a much worse difference with the existence of an index ....

Related

PostgreSQL: get latest row for each time interval

I have the following table. It is stored as a TimescaleDB hypertable. Data rate is 1 row per second.
CREATE TABLE electricity_data
(
"time" timestamptz NOT NULL,
meter_id integer REFERENCES meters NOT NULL,
import_low double precision,
import_normal double precision,
export_low double precision,
export_normal double precision,
PRIMARY KEY ("time", meter_id)
)
I would like to get the latest row in a given time interval, over a period of time.
For instance the latest record each month for the previous year.
The following query works but is slow:
EXPLAIN ANALYZE
SELECT
DISTINCT ON (bucket)
time_bucket('1 month', "time", 'Europe/Amsterdam') AS bucket,
import_low,
import_normal,
export_low,
export_normal
FROM electricity_data
WHERE meter_id = 1
AND "time" BETWEEN '2022-01-01T00:00:00 Europe/Amsterdam' AND '2023-01-01T00:00:00 Europe/Amsterdam'
ORDER BY bucket DESC
Unique (cost=0.42..542380.99 rows=200 width=40) (actual time=3654.263..59130.398 rows=12 loops=1)
-> Custom Scan (ChunkAppend) on electricity_data (cost=0.42..514045.41 rows=11334231 width=40) (actual time=3654.260..58255.396 rows=11161474 loops=1)
Order: time_bucket('1 mon'::interval, electricity_data.""time"", 'Europe/Amsterdam'::text, NULL::timestamp with time zone, NULL::interval) DESC
-> Index Scan using _hyper_12_1533_chunk_electricity_data_time_idx on _hyper_12_1533_chunk (cost=0.42..11530.51 rows=255951 width=40) (actual time=3654.253..3986.885 rows=255582 loops=1)
Index Cond: ((""time"" >= '2021-12-31 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
Filter: (meter_id = 1)
Rows Removed by Filter: 24330
-> Index Scan Backward using ""1529_1849_electricity_data_pkey"" on _hyper_12_1529_chunk (cost=0.42..25777.81 rows=604553 width=40) (actual time=1.468..1810.493 rows=603808 loops=1)
Index Cond: ((""time"" >= '2021-12-31 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone) AND (meter_id = 1))
(...)
Planning Time: 57.424 ms
JIT:
Functions: 217
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 43.496 ms, Inlining 18.805 ms, Optimization 2348.206 ms, Emission 1288.087 ms, Total 3698.594 ms
Execution Time: 59176.016 ms
Getting the latest row for a single month is instantaneous:
EXPLAIN ANALYZE
SELECT
"time",
import_low,
import_normal,
export_low,
export_normal
FROM electricity_data
WHERE meter_id = 1
AND "time" BETWEEN '2022-12-01T00:00:00 Europe/Amsterdam' AND '2023-01-01T00:00:00 Europe/Amsterdam'
ORDER BY "time" DESC
LIMIT 1
Limit (cost=0.42..0.47 rows=1 width=40) (actual time=0.048..0.050 rows=1 loops=1)
-> Custom Scan (ChunkAppend) on electricity_data (cost=0.42..11530.51 rows=255951 width=40) (actual time=0.047..0.048 rows=1 loops=1)
Order: electricity_data.""time"" DESC
-> Index Scan using _hyper_12_1533_chunk_electricity_data_time_idx on _hyper_12_1533_chunk (cost=0.42..11530.51 rows=255951 width=40) (actual time=0.046..0.046 rows=1 loops=1)
Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
Filter: (meter_id = 1)
-> Index Scan Backward using ""1529_1849_electricity_data_pkey"" on _hyper_12_1529_chunk (cost=0.42..25777.81 rows=604553 width=40) (never executed)
Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone) AND (meter_id = 1))
(...)
-> Index Scan using _hyper_12_1512_chunk_electricity_data_time_idx on _hyper_12_1512_chunk (cost=0.42..8.94 rows=174 width=40) (never executed)
Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
Filter: (meter_id = 1)
Planning Time: 2.162 ms
Execution Time: 0.152 ms
Is there a way to execute the query above for each month or custom time interval? Or is there a different way to speed up the first query?
Edit
The following query takes 10 seconds, which is much better, but still slower than the manual approach. An index does not seem to make a difference.
EXPLAIN ANALYZE
SELECT MAX("time") AS "time"
FROM electricity_data
WHERE meter_id = 1
AND "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
GROUP BY time_bucket('1 month', "time", 'Europe/Amsterdam');
(... plan removed)
Planning Time: 50.463 ms
JIT:
Functions: 451
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 76.476 ms, Inlining 0.000 ms, Optimization 13.849 ms, Emission 416.718 ms, Total 507.043 ms
Execution Time: 9910.058 ms
I'd recommend using the last aggregate and a continuous aggregate to solve this problem.
Like the previous poster, I'd also recommend an index on meter, time rather than the other way around, you can do this in your table definition by just changing the order of keys in your primary key definition.
CREATE TABLE electricity_data
(
"time" timestamptz NOT NULL,
meter_id integer REFERENCES meters NOT NULL,
import_low double precision,
import_normal double precision,
export_low double precision,
export_normal double precision,
PRIMARY KEY ( meter_id, "time")
);
But that's a bit off topic. The basic query you'll want to do is something like:
SELECT time_bucket('1 day', "time", 'Europe/Amsterdam'),
meter_id,
last(electricity_data, "time")
FROM electricity_data
GROUP BY 1, 2;
This is a bit confusing until you realize that the table itself is also a type in PostgreSQL - so you can ask for and return a composite type from this call to the last aggregate, which will get the latest value in the month or day or whatever you want.
Then you have to be able to treat that as a row again, so you can expand that by using parentheses and a .* which is how composite types can be expanded in PG.
SELECT time_bucket('1 month', "time", 'Europe/Amsterdam'),
meter_id,
(last(electricity_data, "time")).*
FROM electricity_data
GROUP BY 1,2;
Now, in order to speed things up, you can turn that into a continuous aggregate which will make things much faster.
CREATE MATERIALIZED VIEW last_meter_month WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month', "time", 'Europe/Amsterdam'),
(last(electricity_data, "time")).*
FROM electricity_data
GROUP BY 1, meter_id;
You'll note that I took the meter_id out of the initial select list because that's gonna come from our composite type and I don't need the redundant column, nor can I have two columns with the same name in a view, but I did keep meter_id in my group by.
So that'll speed things up nicely, but, if I were you, I might actually think about doing this on a daily basis and creating a hierarchical continuous aggregate for this type of thing.
CREATE MATERIALIZED VIEW last_meter_day WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', "time", 'Europe/Amsterdam'),
(last(electricity_data, "time")).*
FROM electricity_data
GROUP BY 1, meter_id;
CREATE MATERIALIZED VIEW last_meter_month WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month',time_bucket, 'Europe/Amsterdam') as month_bucket,
(last(last_meter_day, time_bucket)).*
FROM last_meter_day
GROUP BY 1, meter_id;
The reason for that is that we can't really refresh a monthly continuous aggregate all that often, it's much easier to refresh a daily aggregate and then roll that up into a monthly aggregate more frequently. You could also just have the daily aggregate and roll up to month on the fly in your query as that would be at most 30 days per meter, but of course that won't be as performant.
You'll then have to create continuous aggregate policies for these based on what you want to have happen on refresh.
I'd also suggest, depending on what you're trying to do with this, that you might want to take a look at counter_agg as it might be useful for you. I also recently wrote a post in our forum about how to use it with electricity meters that might be helpful for you depending on how you're processing this data.
You can try an approach that uses a subquery to get the timestamp of the latest time in each bucket. Then, join that to your detail table.
SELECT meter_id, MAX("time") "time"
FROM electricity_data
WHERE "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
GROUP BY meter_id,
time_bucket('1 month', "time", 'Europe/Amsterdam')
That gets you a virtual table with the latest time for each meter for each time bucket (month in this case). It can be accelerated with this index, the same as your primary key but with the columns in the opposite order. With the columns in that order the query can be satisfied with a relatively quick index scan.
CREATE INDEX meter_time ON electricity_data (meter_id, "time")
Then join that to your detail table. Like this.
SELECT d.meter_id
time_bucket('1 month', d."time", 'Europe/Amsterdam') AS bucket,
d."time",
d.import_low,
d.import_normal,
d.export_low,
d.export_normal
FROM electricity_data d
JOIN (
SELECT meter_id, MAX("time") "time"
FROM electricity_data
WHERE "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
GROUP BY meter_id,
time_bucket('1 month', "time", 'Europe/Amsterdam')
) last ON d."time" = last."time"
AND d.meter_id = last.meter_id
ORDER BY d.meter_id, bucket DESC
(I'm not completely sure of the syntax in TimeScaleDB for columns that have the same name as reserved words like time, so this isn't tested.)
If you want just one meter, put a WHERE clause right before the last ORDER BY clause.
The other answers are likely more useful in most cases.
I wanted a solution that works for any interval,
without the need for continuous aggregates.
I ended up with the following query, using a lateral join. I use the lag function to compute energy consumption/generation in a time bucket (omitted below). Variables $__interval, $__timeFrom() and $__timeTo() specify the chosen bucket interval and time range.
SELECT bucket, import_low, import_normal, export_low, export_normal
FROM (
SELECT
tstzrange(
-- Could also use date_trunc or date_bin
time_bucket(INTERVAL '$__interval', d, 'Europe/Amsterdam'),
time_bucket(INTERVAL '$__interval', d + INTERVAL '$__interval', 'Europe/Amsterdam'),
'(]' -- We use an inclusive upper bound, because a meter reading on the upper boundary applies to the previous period
) bucket
FROM generate_series($__timeFrom(), $__timeTo(), INTERVAL '$__interval') d
) buckets
LEFT JOIN LATERAL (
SELECT *
FROM electricity_data
WHERE meter_id = $meterId AND "time" <# bucket
ORDER BY "time" DESC
LIMIT 1
) elec ON true
ORDER BY bucket;

How do I convince postgres to choose the MUCH more efficient of two nearly identical indexes (6 orders of magnitude more efficient)

I have some huge postgres tables that seem to be using the wrong index. In a big way. Like, in a 'if I remove one index, the query performance goes up by six orders of magnitude' way. (For those of you counting, that's ~1ms to 32 minutes.) We vacuum and analyze this table daily.
Simplified table for easier parsing:
action
-----
id bigint
org bigint
created datetime without time zone
action_time datetime without time zone
Query:
SELECT min(created) FROM action
WHERE org = 10
AND created > NOW() - INTERVAL '25 hour'
AND action_time < NOW() - INTERVAL '1 hour'
Two indexes:
action (org, action_time, created)
action (org, created, action_time)
Let's say an org creates 200k events a day, and has been running for a year. That means that 99.99% of the items in the action table were created more than an hour ago, and action_time is almost always roughly around when they are created, with much less than 0.01% of them more than a few minutes earlier. This means that around 99.99% of rows satisfy the action_time < NOW() - INTERVAL '1 hour' clause.
On the other hand, around 0.3% of rows were created in the last 25 hours, thereby satisfying the created > NOW() - INTERVAL '25 hour' clause.
So guess which index it uses?
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=55.45..55.46 rows=1 width=8)
InitPlan 1 (returns $0)
-> Limit (cost=0.71..55.45 rows=1 width=8)
-> Index Only Scan using ix_action_org_action_time_created on action (cost=0.71..11498144.88 rows=210051 width=8)
Index Cond: ((org = 50) AND (action_time IS NOT NULL) AND (action_time < (now() - '01:00:00'::interval)) AND (created > (now() - '25:00:00'::interval)))
(5 rows)
Yup! It loads the entire index and scans through literally 99.99% of it searching for the 0.3%, rather than loading 0.3% of the other index and then examining it for the matching 99.99% of those records. Of course, if I drop the second index, it immediately starts using the correct one and the performance goes up accordingly.
Postgres doesn't support index hinting, and as far as I can tell none of the workarounds that the postgres dev team says are much better than index hinting would help here in any way. Possibly there is some way to tell it that created has a roughly uniform distribution over years (and so does action_time)? Would that even help, given that I can't even imagine how it wouldn't know that already? Is there anything else that could help?
edit: explain verbose:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=57.48..57.49 rows=1 width=8)
Output: $0
InitPlan 1 (returns $0)
-> Limit (cost=0.72..57.48 rows=1 width=8)
Output: action.action_time
-> Index Only Scan using ix_action_org_action_time_created on public.action (cost=0.72..11851726.67 rows=208788 width=8)
Output: action.action_time
Index Cond: ((action.org = 10) AND (action.action_time IS NOT NULL) AND (action.action_time < (now() - '01:00:00'::interval)) AND (action.created > (now() - '25:00:00'::interval)) AND (action.created < now()))
(8 rows)
I'll add explain (analyze, buffers, verbose) should this ever finish running. Sigh.
edit2: business logic: action_time is ALMOST always before created. 99.999+% of the time. No other requirements, and even that one isn't perfect.

Query optimization- How to achieve that in this query?

How can I optimize this query ? I have created indexes,partitions,increased worker memory but the execution time is still 35s. How can I minimize it to 10-15 seconds?
Update :
Removed conversion of every time stamp from utc to local time i.e. time_stamp AT TIME ZONE 'utc' AT TIME ZONE which improved the performance by approximately 5 seconds. Current execution time : 36.5 seconds.
explain analyse select
DATE_TRUNC('day', time_stamp) as "time_stamp",
COUNT(DISTINCT id) AS alarm_count,
COUNT(DISTINCT patient_id) AS patient_count
FROM
alarm_management.alarm
WHERE
tenant_name = 'abc'
and
unit = ANY('{a,b,c,d,e,f,g,h,i,j,k}'::text[])
AND
time_stamp BETWEEN '2021-09-15 02:25:00' AND '2021-12-14 04:36:45'
AND
severity_label = ANY('{a,b,c,d}'::text[])
AND derived_label IS NOT NULL
GROUP by 1
Explain(analyze, verbose, buffers) output-
GroupAggregate (cost=3064683.77..3215681.44 rows=308821 width=24) (actual time=24242.730..35145.380 rows=91 loops=1)
Group Key: (date_trunc('day'::text, alarm_hospitalc_burn_2021_9.time_stamp))
-> Sort (cost=3064683.77..3101468.12 rows=14713740 width=40) (actual time=24167.513..25036.293 rows=16369464 loops=1)
Sort Key: (date_trunc('day'::text, alarm_hospitalc_burn_2021_9.time_stamp))
Sort Method: quicksort Memory: 1672081kB
-> Append (cost=0.00..1312964.42 rows=14713740 width=40) (actual time=0.308..20958.290 rows=16369464 loops=1)
-> Seq Scan on alarm_hospitalc_burn_2021_9 (cost=0.00..7175.10 rows=69691 width=40) (actual time=0.307..127.521 rows=94286 loops=1)
Filter: ((derived_label IS NOT NULL) AND (time_stamp >= '2021-09-15 02:25:00'::timestamp without time zone) AND (time_stamp <= '2021-12-14 04:36:45'::timestamp without time zone) AND (tenant_name = 'HospitalC'::text) AND (severity_label = ANY ('{"Short Yellow",Cyan,Red,Yellow}'::text[])) AND (unit = ANY ('{Burn,Delivery,EDI,EDT,EDW,ICU1,ICU2,ICU3P,ICU4P,PP,Tele}'::text[])))
The function can be written in SQL, that might be slightly faster:
CREATE OR REPLACE FUNCTION dbo.get_time_group ( _date_type TEXT )
RETURNS TEXT
LANGUAGE sql -- SQL is good enough
IMMUTABLE -- better for performance, next call is faster because of caching
AS
$$
SELECT CASE $1
WHEN 'hour' THEN 'hour'
ELSE 'day'
END;
$$;
But the most important thing is the query plan.

How can I utilize a partial index for the calculated filter condition in a where clause?

Let's say I have this simple query:
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
audiences a
WHERE
a.created_at >= (current_date - INTERVAL '5 days');
This is a 1GB+ table with a partial index on created_at column. When I run this query it does sequential scan and does not utilise my index which obviously takes much time:
Aggregate (cost=345853.43..345853.44 rows=1 width=8) (actual time=27126.426..27126.426 rows=1 loops=1)
-> Seq Scan on audiences a (cost=0.00..345840.46 rows=5188 width=0) (actual time=97.564..27124.317 rows=8029 loops=1)
Filter: (created_at >= (('now'::cstring)::date - '5 days'::interval))
Rows Removed by Filter: 2215612
Planning time: 0.131 ms
Execution time: 27126.458 ms
On the other hand if I'd have a "hardcoded" (or pre-calculated) value like this:
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
audiences a
WHERE
a.created_at >= TIMESTAMP '2020-10-16 00:00:00';
It would utilise an index on created_at:
Aggregate (cost=253.18..253.19 rows=1 width=8) (actual time=1014.655..1014.655 rows=1 loops=1)
-> Index Only Scan using index_audiences_on_created_at on audiences a (cost=0.29..240.21 rows=5188 width=0) (actual time=1.308..1011.071 rows=8029 loops=1)
Index Cond: (created_at >= '2020-10-16 00:00:00'::timestamp without time zone)
Heap Fetches: 6185
Planning time: 1.878 ms
Execution time: 1014.716 ms
If I could I'd just use an ORM and generate a query with the right value but I can't. Is there a way I can maybe pre-calculate this timestamp and use it in a WHERE clause via plain SQL?
Adding a little bit of tech info of my setup.
PostgreSQL version: 9.6.11
created_at column type is: timestamp
index: "index_audiences_on_created_at" btree (created_at) WHERE created_at > '2020-10-01 00:00:00'::timestamp without time zone
This is not the exact answer. But can do with specific situation
As you have the predicate (created_at > '2020-10-01 00:00:00'::timestamp without time zone) , if the filtering condition is greater than the predicate condition. Then you can prepend the condition in where
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
audiences a
WHERE
a.created_at >= TIMESTAMP '2020-10-16 00:00:00'
and
a.created_at >= (current_date - INTERVAL '5 days');
Note: may be instead of TIMESTAMP , you have to put TIMESTAMP without time zone or TIMESTAMP with time zone. Depends on column type
You have a partial index, and the optimizer is not smart enough to evaluate the expression in the where clause and then choose the partial index based on the expression's result.
So there is not much you can do, except creating an index without a WHERE clause.
From my experience, the best approach is to create a PL function. I think the problem is that it is evaluating this (current_date - INTERVAL '5 days') for every row.
So you would have to create a PL that evaluates it once and then use it for the query. Something like:
CREATE OR REPLACE FUNCTION count_audiences(
vinterval text -- to send interval dynamically
)
RETURNS INTEGER
AS $$
DECLARE
vdate timestamp;
vcount integer := 0;
BEGIN
EXECUTE 'SELECT current_date - INTERVAL '''||vinterval||'''' INTO vdate;-- obtain date
SELECT
COUNT(*)
INTO
vcount
FROM
audiences a
WHERE
a.created_at >= vdate;
RETURN vcount;
END;
$$ LANGUAGE plpgsql;
After creating the PL, you just have to call it like:
SELECT * FROM count_audiences('5 days');
In this way you can also use an ORM to call for the function.
Works here (given a usable index on created_at):
select version();
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
tweets a
WHERE
a.created_at >= (current_date - INTERVAL '5 days')
;
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
tweets a
WHERE
a.created_at >= TIMESTAMP '2020-10-16 00:00:00'
;
\d tweets
Output:
version
-------------------------------------------------------------------------------------------------------
PostgreSQL 11.3 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4, 64-bit
(1 row)
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2.90..2.91 rows=1 width=8) (actual time=0.100..0.101 rows=1 loops=1)
-> Index Only Scan using tweets_du_idx on tweets a (cost=0.56..2.90 rows=1 width=0) (actual time=0.088..0.088 rows=0 loops=1)
Index Cond: (created_at >= (CURRENT_DATE - '5 days'::interval))
Heap Fetches: 0
Planning Time: 2.357 ms
Execution Time: 0.217 ms
(6 rows)
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2.89..2.90 rows=1 width=8) (actual time=0.016..0.017 rows=1 loops=1)
-> Index Only Scan using tweets_du_idx on tweets a (cost=0.56..2.89 rows=1 width=0) (actual time=0.014..0.014 rows=0 loops=1)
Index Cond: (created_at >= '2020-10-16 00:00:00'::timestamp without time zone)
Heap Fetches: 0
Planning Time: 0.163 ms
Execution Time: 0.045 ms
(6 rows)
Table "public.tweets"
Column | Type | Collation | Nullable | Default
----------------+--------------------------+-----------+----------+---------
seq | bigint | | not null |
id | bigint | | not null |
user_id | bigint | | not null |
in_reply_to_id | bigint | | not null | 0
parent_seq | bigint | | not null | 0
sucker_id | integer | | not null | 0
created_at | timestamp with time zone | | |
[snip
body | text | | |
zoek | tsvector | | |
Indexes:
"tweets_pkey" PRIMARY KEY, btree (seq)
"tweets_id_key" UNIQUE CONSTRAINT, btree (id)
"tweets_stamp_idx" UNIQUE, btree (fetch_stamp, seq)
"tweets_userid_id" UNIQUE, btree (user_id, id)
"tweets_du_idx" btree (created_at, user_id)
"tweets_id_idx" btree (id) WHERE need_refetch = true
"tweets_in_reply_to_id_created_at_idx" btree (in_reply_to_id, created_at) WHERE is_retweet = false AND did_resolve = false AND in_reply_to_id > 0
"tweets_in_reply_to_id_fp" btree (in_reply_to_id)
"tweets_parent_seq_fk" btree (parent_seq)
"tweets_ud_idx" btree (user_id, created_at)
"tweets_zoek" gin (zoek)
Foreign-key constraints:
"tweets_parent_seq_fkey" FOREIGN KEY (parent_seq) REFERENCES tweets(seq)
"tweets_user_id_fkey" FOREIGN KEY (user_id) REFERENCES tweeps(id)
Referenced by:
TABLE "tweets" CONSTRAINT "tweets_parent_seq_fkey" FOREIGN KEY (parent_seq) REFERENCES tweets(seq)
Triggers:
tr_upd_zzoek_i BEFORE INSERT ON tweets FOR EACH ROW EXECUTE PROCEDURE tf_tweets_upd_zzoek()
tr_upd_zzoek_u BEFORE UPDATE ON tweets FOR EACH ROW WHEN (new.body <> old.body) EXECUTE PROCEDURE tf_tweets_upd_zzoek()

Does "group by" automatically guarantee "order by"?

Does "group by" clause automatically guarantee that the results will be ordered by that key? In other words, is it enough to write:
select *
from table
group by a, b, c
or does one have to write
select *
from table
group by a, b, c
order by a, b, c
I know e.g. in MySQL I don't have to, but I would like to know if I can rely on it accross the SQL implementations. Is it guaranteed?
group by does not order the data neccessarily. A DB is designed to grab the data as fast as possible and only sort if necessary.
So add the order by if you need a guaranteed order.
An efficient implementation of group by would perform the group-ing by sorting the data internally. That's why some RDBMS return sorted output when group-ing. Yet, the SQL specs don't mandate that behavior, so unless explicitly documented by the RDBMS vendor I wouldn't bet on it to work (tomorrow). OTOH, if the RDBMS implicitly does a sort it might also be smart enough to then optimize (away) the redundant order by. #jimmyb
An example using PostgreSQL proving that concept
Creating a table with 1M records, with random dates in a day range from today - 90 and indexing by date
CREATE TABLE WITHDRAW AS
SELECT (random()*1000000)::integer AS IDT_WITHDRAW,
md5(random()::text) AS NAM_PERSON,
(NOW() - ( random() * (NOW() + '90 days' - NOW()) ))::timestamp AS DAT_CREATION, -- de hoje a 90 dias atras
(random() * 1000)::decimal(12, 2) AS NUM_VALUE
FROM generate_series(1,1000000);
CREATE INDEX WITHDRAW_DAT_CREATION ON WITHDRAW(DAT_CREATION);
Grouping by date truncated by day of month, restricting select by dates in a two days range
EXPLAIN
SELECT
DATE_TRUNC('DAY', W.dat_creation), COUNT(1), SUM(W.NUM_VALUE)
FROM WITHDRAW W
WHERE W.dat_creation >= (NOW() - INTERVAL '2 DAY')::timestamp
AND W.dat_creation < (NOW() - INTERVAL '1 DAY')::timestamp
GROUP BY 1
HashAggregate (cost=11428.33..11594.13 rows=11053 width=48)
Group Key: date_trunc('DAY'::text, dat_creation)
-> Bitmap Heap Scan on withdraw w (cost=237.73..11345.44 rows=11053 width=14)
Recheck Cond: ((dat_creation >= ((now() - '2 days'::interval))::timestamp without time zone) AND (dat_creation < ((now() - '1 day'::interval))::timestamp without time zone))
-> Bitmap Index Scan on withdraw_dat_creation (cost=0.00..234.97 rows=11053 width=0)
Index Cond: ((dat_creation >= ((now() - '2 days'::interval))::timestamp without time zone) AND (dat_creation < ((now() - '1 day'::interval))::timestamp without time zone))
Using a larger restriction date range, it chooses to apply a SORT
EXPLAIN
SELECT
DATE_TRUNC('DAY', W.dat_creation), COUNT(1), SUM(W.NUM_VALUE)
FROM WITHDRAW W
WHERE W.dat_creation >= (NOW() - INTERVAL '60 DAY')::timestamp
AND W.dat_creation < (NOW() - INTERVAL '1 DAY')::timestamp
GROUP BY 1
GroupAggregate (cost=116522.65..132918.32 rows=655827 width=48)
Group Key: (date_trunc('DAY'::text, dat_creation))
-> Sort (cost=116522.65..118162.22 rows=655827 width=14)
Sort Key: (date_trunc('DAY'::text, dat_creation))
-> Seq Scan on withdraw w (cost=0.00..41949.57 rows=655827 width=14)
Filter: ((dat_creation >= ((now() - '60 days'::interval))::timestamp without time zone) AND (dat_creation < ((now() - '1 day'::interval))::timestamp without time zone))
Just by adding ORDER BY 1 at the end (there is no significant difference)
GroupAggregate (cost=116522.44..132918.06 rows=655825 width=48)
Group Key: (date_trunc('DAY'::text, dat_creation))
-> Sort (cost=116522.44..118162.00 rows=655825 width=14)
Sort Key: (date_trunc('DAY'::text, dat_creation))
-> Seq Scan on withdraw w (cost=0.00..41949.56 rows=655825 width=14)
Filter: ((dat_creation >= ((now() - '60 days'::interval))::timestamp without time zone) AND (dat_creation < ((now() - '1 day'::interval))::timestamp without time zone))
PostgreSQL 10.3
It depends on the database vendor.
For example PostgreSQL does not automatically sort the grouped result.
Here you have to use order by to get the data sorted.
But Sybase and Microsoft SQL Server do. Here you can use order by to change the default sorting.
It definitely doesn't. I have experienced that, once one of my queries suddenly started to return not-ordered results, as the data in the table grows by.
I tried it. Adventureworks db of Msdn.
select HireDate, min(JobTitle)
from AdventureWorks2016CTP3.HumanResources.Employee
group by HireDate
Resuts :
2009-01-10Production Technician - WC40
2009-01-11Application Specialist
2009-01-12Assistant to the Chief Financial Officer
2009-01-13Production Technician - WC50<
It returns sorted data of hiredate, but you don't rely on GROUP BY to SORT under any circumstances.
for example; indexes can change this sorted data.
I added following index (hiredate, jobtitle)
CREATE NONCLUSTERED INDEX NonClusturedIndex_Jobtitle_hireddate ON [HumanResources].[Employee]
(
[JobTitle] ASC,
[HireDate] ASC
)
Result will change with same select query;
2006-06-30 Production Technician - WC60
2007-01-26 Marketing Assistant
2007-11-11 Engineering Manager
2007-12-05 Senior Tool Designer
2007-12-11 Tool Designer
2007-12-20 Marketing Manager
2007-12-26 Production Supervisor - WC60
You can download Adventureworks2016 at the following address
https://www.microsoft.com/en-us/download/details.aspx?id=49502
It depends on the number of records. When the records are less, Group by sorted automatically. When the records are more(more than 15) it required adding Order by clause