PostgreSQL aggregate query is very slow

PostgreSQL aggregate query is very slow - sql

I have a table, which contains a timestamp column and a source column varchar(20). I insert a couple thousand entries into this table every hour and I would like to show an aggregate on this data. My query looks like this:
EXPLAIN (analyze, buffers) SELECT
count(*) AS count
FROM frontend_car c
WHERE date_created at time zone 'cet' > now() at time zone 'cet' - interval '1 week'
GROUP BY source, date_trunc('hour', c.date_created at time zone 'CET')
ORDER BY source ASC, date_trunc('hour', c.date_created at time zone 'CET') DESC
I have already created an index like so:
create index source_date_created on
table_name(
(date_created AT TIME ZONE 'CET') DESC,
source ASC,
date_trunc('hour', date_created at time zone 'CET') DESC
);
And the output of my ANALYZE is:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=142888.08..142889.32 rows=495 width=16) (actual time=10242.141..10242.188 rows=494 loops=1)
Sort Key: source, (date_trunc('hour'::text, timezone('CET'::text, date_created)))
Sort Method: quicksort Memory: 63kB
Buffers: shared hit=27575 read=28482
-> HashAggregate (cost=142858.50..142865.93 rows=495 width=16) (actual time=10236.393..10236.516 rows=494 loops=1)
Group Key: source, date_trunc('hour'::text, timezone('CET'::text, date_created))
Buffers: shared hit=27575 read=28482
-> Bitmap Heap Scan on frontend_car c (cost=7654.61..141002.20 rows=247507 width=16) (actual time=427.894..10122.438 rows=249056 loops=1)
Recheck Cond: (timezone('cet'::text, date_created) > (timezone('cet'::text, now()) - '7 days'::interval))
Rows Removed by Index Recheck: 141143
Heap Blocks: exact=27878 lossy=26713
Buffers: shared hit=27575 read=28482
-> Bitmap Index Scan on frontend_car_source_date_created (cost=0.00..7592.74 rows=247507 width=0) (actual time=420.415..420.415 rows=249056 loops=1)
Index Cond: (timezone('cet'::text, date_created) > (timezone('cet'::text, now()) - '7 days'::interval))
Buffers: shared hit=3 read=1463
Planning time: 2.430 ms
Execution time: 10242.379 ms
(17 rows)
Clearly this is too slow and in my mind it should be able to be computed only using indexes, if I use only either time or source for aggregation, it is reasonably fast, but together somehow its slow.
This is on a rather small VPS with only 512MB of ram and the database presently contains about 700k rows.
From what I read it seems that the majority of time is spent on recheck, which means that the index did not fit in memory?

It sounds like what you really need is a separate aggregate table that gets records inserted or updated via a trigger in your detailed table. The summary table would have your source column, a date/time field to hold only the date and hour portion (truncating any minutes), and finally the running count.
As records are inserted, this summary table gets updated, then your query could be directly on this table. Since it will already be pre-aggregated by source, date and hour, your query would just need to apply the where clause and order that by the source.
I'm not fluent at all with postgresql, but am sure they have their own means of insert triggers. So, if you have 1000s of entries each hour, and say you have 10 sources. Your entire result set of this aggregate summary table would only be 24(hrs) * ex 50(sources) = 1200 records per day vs 50k, 60k, 70k+ per day. If you then need the exact details per a given date/hour basis, you could then drill-into the details on an as-needed basis. But really, how many "sources" are you dealing with is unclear.
I would strongly consider this as a solution to your needs.

Related

PostgreSQL: get latest row for each time interval

I have the following table. It is stored as a TimescaleDB hypertable. Data rate is 1 row per second.
CREATE TABLE electricity_data
(
"time" timestamptz NOT NULL,
meter_id integer REFERENCES meters NOT NULL,
import_low double precision,
import_normal double precision,
export_low double precision,
export_normal double precision,
PRIMARY KEY ("time", meter_id)
)
I would like to get the latest row in a given time interval, over a period of time.
For instance the latest record each month for the previous year.
The following query works but is slow:
EXPLAIN ANALYZE
SELECT
DISTINCT ON (bucket)
time_bucket('1 month', "time", 'Europe/Amsterdam') AS bucket,
import_low,
import_normal,
export_low,
export_normal
FROM electricity_data
WHERE meter_id = 1
AND "time" BETWEEN '2022-01-01T00:00:00 Europe/Amsterdam' AND '2023-01-01T00:00:00 Europe/Amsterdam'
ORDER BY bucket DESC
Unique (cost=0.42..542380.99 rows=200 width=40) (actual time=3654.263..59130.398 rows=12 loops=1)
-> Custom Scan (ChunkAppend) on electricity_data (cost=0.42..514045.41 rows=11334231 width=40) (actual time=3654.260..58255.396 rows=11161474 loops=1)
Order: time_bucket('1 mon'::interval, electricity_data.""time"", 'Europe/Amsterdam'::text, NULL::timestamp with time zone, NULL::interval) DESC
-> Index Scan using _hyper_12_1533_chunk_electricity_data_time_idx on _hyper_12_1533_chunk (cost=0.42..11530.51 rows=255951 width=40) (actual time=3654.253..3986.885 rows=255582 loops=1)
Index Cond: ((""time"" >= '2021-12-31 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
Filter: (meter_id = 1)
Rows Removed by Filter: 24330
-> Index Scan Backward using ""1529_1849_electricity_data_pkey"" on _hyper_12_1529_chunk (cost=0.42..25777.81 rows=604553 width=40) (actual time=1.468..1810.493 rows=603808 loops=1)
Index Cond: ((""time"" >= '2021-12-31 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone) AND (meter_id = 1))
(...)
Planning Time: 57.424 ms
JIT:
Functions: 217
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 43.496 ms, Inlining 18.805 ms, Optimization 2348.206 ms, Emission 1288.087 ms, Total 3698.594 ms
Execution Time: 59176.016 ms
Getting the latest row for a single month is instantaneous:
EXPLAIN ANALYZE
SELECT
"time",
import_low,
import_normal,
export_low,
export_normal
FROM electricity_data
WHERE meter_id = 1
AND "time" BETWEEN '2022-12-01T00:00:00 Europe/Amsterdam' AND '2023-01-01T00:00:00 Europe/Amsterdam'
ORDER BY "time" DESC
LIMIT 1
Limit (cost=0.42..0.47 rows=1 width=40) (actual time=0.048..0.050 rows=1 loops=1)
-> Custom Scan (ChunkAppend) on electricity_data (cost=0.42..11530.51 rows=255951 width=40) (actual time=0.047..0.048 rows=1 loops=1)
Order: electricity_data.""time"" DESC
-> Index Scan using _hyper_12_1533_chunk_electricity_data_time_idx on _hyper_12_1533_chunk (cost=0.42..11530.51 rows=255951 width=40) (actual time=0.046..0.046 rows=1 loops=1)
Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
Filter: (meter_id = 1)
-> Index Scan Backward using ""1529_1849_electricity_data_pkey"" on _hyper_12_1529_chunk (cost=0.42..25777.81 rows=604553 width=40) (never executed)
Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone) AND (meter_id = 1))
(...)
-> Index Scan using _hyper_12_1512_chunk_electricity_data_time_idx on _hyper_12_1512_chunk (cost=0.42..8.94 rows=174 width=40) (never executed)
Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
Filter: (meter_id = 1)
Planning Time: 2.162 ms
Execution Time: 0.152 ms
Is there a way to execute the query above for each month or custom time interval? Or is there a different way to speed up the first query?
Edit
The following query takes 10 seconds, which is much better, but still slower than the manual approach. An index does not seem to make a difference.
EXPLAIN ANALYZE
SELECT MAX("time") AS "time"
FROM electricity_data
WHERE meter_id = 1
AND "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
GROUP BY time_bucket('1 month', "time", 'Europe/Amsterdam');
(... plan removed)
Planning Time: 50.463 ms
JIT:
Functions: 451
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 76.476 ms, Inlining 0.000 ms, Optimization 13.849 ms, Emission 416.718 ms, Total 507.043 ms
Execution Time: 9910.058 ms

I'd recommend using the last aggregate and a continuous aggregate to solve this problem.
Like the previous poster, I'd also recommend an index on meter, time rather than the other way around, you can do this in your table definition by just changing the order of keys in your primary key definition.
CREATE TABLE electricity_data
(
"time" timestamptz NOT NULL,
meter_id integer REFERENCES meters NOT NULL,
import_low double precision,
import_normal double precision,
export_low double precision,
export_normal double precision,
PRIMARY KEY ( meter_id, "time")
);
But that's a bit off topic. The basic query you'll want to do is something like:
SELECT time_bucket('1 day', "time", 'Europe/Amsterdam'),
meter_id,
last(electricity_data, "time")
FROM electricity_data
GROUP BY 1, 2;
This is a bit confusing until you realize that the table itself is also a type in PostgreSQL - so you can ask for and return a composite type from this call to the last aggregate, which will get the latest value in the month or day or whatever you want.
Then you have to be able to treat that as a row again, so you can expand that by using parentheses and a .* which is how composite types can be expanded in PG.
SELECT time_bucket('1 month', "time", 'Europe/Amsterdam'),
meter_id,
(last(electricity_data, "time")).*
FROM electricity_data
GROUP BY 1,2;
Now, in order to speed things up, you can turn that into a continuous aggregate which will make things much faster.
CREATE MATERIALIZED VIEW last_meter_month WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month', "time", 'Europe/Amsterdam'),
(last(electricity_data, "time")).*
FROM electricity_data
GROUP BY 1, meter_id;
You'll note that I took the meter_id out of the initial select list because that's gonna come from our composite type and I don't need the redundant column, nor can I have two columns with the same name in a view, but I did keep meter_id in my group by.
So that'll speed things up nicely, but, if I were you, I might actually think about doing this on a daily basis and creating a hierarchical continuous aggregate for this type of thing.
CREATE MATERIALIZED VIEW last_meter_day WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', "time", 'Europe/Amsterdam'),
(last(electricity_data, "time")).*
FROM electricity_data
GROUP BY 1, meter_id;
CREATE MATERIALIZED VIEW last_meter_month WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month',time_bucket, 'Europe/Amsterdam') as month_bucket,
(last(last_meter_day, time_bucket)).*
FROM last_meter_day
GROUP BY 1, meter_id;
The reason for that is that we can't really refresh a monthly continuous aggregate all that often, it's much easier to refresh a daily aggregate and then roll that up into a monthly aggregate more frequently. You could also just have the daily aggregate and roll up to month on the fly in your query as that would be at most 30 days per meter, but of course that won't be as performant.
You'll then have to create continuous aggregate policies for these based on what you want to have happen on refresh.
I'd also suggest, depending on what you're trying to do with this, that you might want to take a look at counter_agg as it might be useful for you. I also recently wrote a post in our forum about how to use it with electricity meters that might be helpful for you depending on how you're processing this data.

You can try an approach that uses a subquery to get the timestamp of the latest time in each bucket. Then, join that to your detail table.
SELECT meter_id, MAX("time") "time"
FROM electricity_data
WHERE "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
GROUP BY meter_id,
time_bucket('1 month', "time", 'Europe/Amsterdam')
That gets you a virtual table with the latest time for each meter for each time bucket (month in this case). It can be accelerated with this index, the same as your primary key but with the columns in the opposite order. With the columns in that order the query can be satisfied with a relatively quick index scan.
CREATE INDEX meter_time ON electricity_data (meter_id, "time")
Then join that to your detail table. Like this.
SELECT d.meter_id
time_bucket('1 month', d."time", 'Europe/Amsterdam') AS bucket,
d."time",
d.import_low,
d.import_normal,
d.export_low,
d.export_normal
FROM electricity_data d
JOIN (
SELECT meter_id, MAX("time") "time"
FROM electricity_data
WHERE "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
GROUP BY meter_id,
time_bucket('1 month', "time", 'Europe/Amsterdam')
) last ON d."time" = last."time"
AND d.meter_id = last.meter_id
ORDER BY d.meter_id, bucket DESC
(I'm not completely sure of the syntax in TimeScaleDB for columns that have the same name as reserved words like time, so this isn't tested.)
If you want just one meter, put a WHERE clause right before the last ORDER BY clause.

The other answers are likely more useful in most cases.
I wanted a solution that works for any interval,
without the need for continuous aggregates.
I ended up with the following query, using a lateral join. I use the lag function to compute energy consumption/generation in a time bucket (omitted below). Variables $__interval, $__timeFrom() and $__timeTo() specify the chosen bucket interval and time range.
SELECT bucket, import_low, import_normal, export_low, export_normal
FROM (
SELECT
tstzrange(
-- Could also use date_trunc or date_bin
time_bucket(INTERVAL '$__interval', d, 'Europe/Amsterdam'),
time_bucket(INTERVAL '$__interval', d + INTERVAL '$__interval', 'Europe/Amsterdam'),
'(]' -- We use an inclusive upper bound, because a meter reading on the upper boundary applies to the previous period
) bucket
FROM generate_series($__timeFrom(), $__timeTo(), INTERVAL '$__interval') d
) buckets
LEFT JOIN LATERAL (
SELECT *
FROM electricity_data
WHERE meter_id = $meterId AND "time" <# bucket
ORDER BY "time" DESC
LIMIT 1
) elec ON true
ORDER BY bucket;

How do I convince postgres to choose the MUCH more efficient of two nearly identical indexes (6 orders of magnitude more efficient)

I have some huge postgres tables that seem to be using the wrong index. In a big way. Like, in a 'if I remove one index, the query performance goes up by six orders of magnitude' way. (For those of you counting, that's ~1ms to 32 minutes.) We vacuum and analyze this table daily.
Simplified table for easier parsing:
action
-----
id bigint
org bigint
created datetime without time zone
action_time datetime without time zone
Query:
SELECT min(created) FROM action
WHERE org = 10
AND created > NOW() - INTERVAL '25 hour'
AND action_time < NOW() - INTERVAL '1 hour'
Two indexes:
action (org, action_time, created)
action (org, created, action_time)
Let's say an org creates 200k events a day, and has been running for a year. That means that 99.99% of the items in the action table were created more than an hour ago, and action_time is almost always roughly around when they are created, with much less than 0.01% of them more than a few minutes earlier. This means that around 99.99% of rows satisfy the action_time < NOW() - INTERVAL '1 hour' clause.
On the other hand, around 0.3% of rows were created in the last 25 hours, thereby satisfying the created > NOW() - INTERVAL '25 hour' clause.
So guess which index it uses?
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=55.45..55.46 rows=1 width=8)
InitPlan 1 (returns $0)
-> Limit (cost=0.71..55.45 rows=1 width=8)
-> Index Only Scan using ix_action_org_action_time_created on action (cost=0.71..11498144.88 rows=210051 width=8)
Index Cond: ((org = 50) AND (action_time IS NOT NULL) AND (action_time < (now() - '01:00:00'::interval)) AND (created > (now() - '25:00:00'::interval)))
(5 rows)
Yup! It loads the entire index and scans through literally 99.99% of it searching for the 0.3%, rather than loading 0.3% of the other index and then examining it for the matching 99.99% of those records. Of course, if I drop the second index, it immediately starts using the correct one and the performance goes up accordingly.
Postgres doesn't support index hinting, and as far as I can tell none of the workarounds that the postgres dev team says are much better than index hinting would help here in any way. Possibly there is some way to tell it that created has a roughly uniform distribution over years (and so does action_time)? Would that even help, given that I can't even imagine how it wouldn't know that already? Is there anything else that could help?
edit: explain verbose:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=57.48..57.49 rows=1 width=8)
Output: $0
InitPlan 1 (returns $0)
-> Limit (cost=0.72..57.48 rows=1 width=8)
Output: action.action_time
-> Index Only Scan using ix_action_org_action_time_created on public.action (cost=0.72..11851726.67 rows=208788 width=8)
Output: action.action_time
Index Cond: ((action.org = 10) AND (action.action_time IS NOT NULL) AND (action.action_time < (now() - '01:00:00'::interval)) AND (action.created > (now() - '25:00:00'::interval)) AND (action.created < now()))
(8 rows)
I'll add explain (analyze, buffers, verbose) should this ever finish running. Sigh.
edit2: business logic: action_time is ALMOST always before created. 99.999+% of the time. No other requirements, and even that one isn't perfect.

Query optimization- How to achieve that in this query?

How can I optimize this query ? I have created indexes,partitions,increased worker memory but the execution time is still 35s. How can I minimize it to 10-15 seconds?
Update :
Removed conversion of every time stamp from utc to local time i.e. time_stamp AT TIME ZONE 'utc' AT TIME ZONE which improved the performance by approximately 5 seconds. Current execution time : 36.5 seconds.
explain analyse select
DATE_TRUNC('day', time_stamp) as "time_stamp",
COUNT(DISTINCT id) AS alarm_count,
COUNT(DISTINCT patient_id) AS patient_count
FROM
alarm_management.alarm
WHERE
tenant_name = 'abc'
and
unit = ANY('{a,b,c,d,e,f,g,h,i,j,k}'::text[])
AND
time_stamp BETWEEN '2021-09-15 02:25:00' AND '2021-12-14 04:36:45'
AND
severity_label = ANY('{a,b,c,d}'::text[])
AND derived_label IS NOT NULL
GROUP by 1
Explain(analyze, verbose, buffers) output-
GroupAggregate (cost=3064683.77..3215681.44 rows=308821 width=24) (actual time=24242.730..35145.380 rows=91 loops=1)
Group Key: (date_trunc('day'::text, alarm_hospitalc_burn_2021_9.time_stamp))
-> Sort (cost=3064683.77..3101468.12 rows=14713740 width=40) (actual time=24167.513..25036.293 rows=16369464 loops=1)
Sort Key: (date_trunc('day'::text, alarm_hospitalc_burn_2021_9.time_stamp))
Sort Method: quicksort Memory: 1672081kB
-> Append (cost=0.00..1312964.42 rows=14713740 width=40) (actual time=0.308..20958.290 rows=16369464 loops=1)
-> Seq Scan on alarm_hospitalc_burn_2021_9 (cost=0.00..7175.10 rows=69691 width=40) (actual time=0.307..127.521 rows=94286 loops=1)
Filter: ((derived_label IS NOT NULL) AND (time_stamp >= '2021-09-15 02:25:00'::timestamp without time zone) AND (time_stamp <= '2021-12-14 04:36:45'::timestamp without time zone) AND (tenant_name = 'HospitalC'::text) AND (severity_label = ANY ('{"Short Yellow",Cyan,Red,Yellow}'::text[])) AND (unit = ANY ('{Burn,Delivery,EDI,EDT,EDW,ICU1,ICU2,ICU3P,ICU4P,PP,Tele}'::text[])))

The function can be written in SQL, that might be slightly faster:
CREATE OR REPLACE FUNCTION dbo.get_time_group ( _date_type TEXT )
RETURNS TEXT
LANGUAGE sql -- SQL is good enough
IMMUTABLE -- better for performance, next call is faster because of caching
AS
$$
SELECT CASE $1
WHEN 'hour' THEN 'hour'
ELSE 'day'
END;
$$;
But the most important thing is the query plan.

Poor performance on simple query

I have a query in a function to select the top row and another for the last row, each query takes around 300ms to execute, and this query is executed a lot of times making the function useless
This is the query (this is a test, in the function parameters change):
SELECT the_geom
FROM "Entries"
WHERE taxiid= 366 and timestamp between '2008-02-06 16:00:00' and timestamp '2008-02-06 16:00:00' + interval '5 minutes'
ORDER BY entryid DESC
LIMIT 1;;
and this is the EXPLAIN ANALYZE output of the query:
QUERY PLAN
--------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------
Seq Scan on "Entries" (cost=0.00..63538.80 rows=70 width=51) (actual time=184.409..342.049 rows=56 loops=1)
Filter: (("timestamp" >= '2008-02-06 16:00:00'::timestamp without time zone) AND ("timestamp" <= '2008-02-06 16:05:00'::timestamp without time zone
) AND (taxiid = 366))
Rows Removed by Filter: 2128847
Planning time: 0.191 ms
Execution time: 342.088 ms
(5 rows)
Is there a better way of getting top and last row?
EDIT:
Thanks Drunix, that did help but, something that i cant understand is happening, with the index you suges i was able to go from ~300 ms to 0.2 ms
but if i change the time interval that is added to the timestamp to 120 minutes the index is not used and it keeps taking 300 ms
here is the proof(5 minute interval):
snowflake=# explain analyze Select the_geom from "Entries"
where taxiid= 366 and "timestamp" between '2008-02-06 16:00:00' and "timestamp" '2008-02-06 16:00:00' + interval '5 minutes'
ORDER BY entryid ASC
LIMIT 1;
QUERY PLAN
-------------------------------------------------------------------------
Limit (cost=149.52..149.52 rows=1 width=55) (actual time=0.129..0.129 rows=1 loops=1)
-> Sort (cost=149.52..149.70 rows=73 width=55) (actual time=0.127..0.127 rows=1 loops=1)
Sort Key: entryid
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using entriesindex on "Entries" (cost=0.43..149.15 rows=73 width=55) (actual time=0.045..0.090 rows=56 loops=1)
Index Cond: ((taxiid = 366) AND ("timestamp" >= '2008-02-06 16:00:00'::timestamp without time zone) AND ("timestamp" <= '2008-02-06 16:
05:00'::timestamp without time zone))
Planning time: 0.266 ms
Execution time: 0.180 ms
(8 rows)
the other one(120 minutes interval):
snowflake=# explain analyze Select the_geom from "Entries"
where taxiid= 366 and "timestamp" between '2008-02-06 16:00:00' and "timestamp" '2008-02-06 16:00:00' + interval '120 minutes'
ORDER BY entryid ASC
LIMIT 1;
QUERY PLAN
-------------------------------------------------------------------------
Limit (cost=0.43..60.02 rows=1 width=55) (actual time=245.570..245.570 rows=1 loops=1)
-> Index Scan using "Entries_pkey" on "Entries" (cost=0.43..97542.75 rows=1637 width=55) (actual time=245.568..245.568 rows=1 loops=1)
Filter: (("timestamp" >= '2008-02-06 16:00:00'::timestamp without time zone) AND ("timestamp" <= '2008-02-06 18:00:00'::timestamp without tim
e zone) AND (taxiid = 366))
Rows Removed by Filter: 853963
Planning time: 0.277 ms
Execution time: 245.616 ms

Ok, rephrasing my comment as an answer:
Unless you already have it you should create a composite index:
create index somename on Entries(taxiid, timestamp);
According to your execution plan the combination of these fields should be rather selective, therefore an index scan should be more efficient. Note that an index on (timestamp, taxiid) is probably much less useful, because it will only be used to limit the row by timestamp. Put the columns that are checked for equality in front in similar cases.

Does "group by" automatically guarantee "order by"?

Does "group by" clause automatically guarantee that the results will be ordered by that key? In other words, is it enough to write:
select *
from table
group by a, b, c
or does one have to write
select *
from table
group by a, b, c
order by a, b, c
I know e.g. in MySQL I don't have to, but I would like to know if I can rely on it accross the SQL implementations. Is it guaranteed?

group by does not order the data neccessarily. A DB is designed to grab the data as fast as possible and only sort if necessary.
So add the order by if you need a guaranteed order.

An efficient implementation of group by would perform the group-ing by sorting the data internally. That's why some RDBMS return sorted output when group-ing. Yet, the SQL specs don't mandate that behavior, so unless explicitly documented by the RDBMS vendor I wouldn't bet on it to work (tomorrow). OTOH, if the RDBMS implicitly does a sort it might also be smart enough to then optimize (away) the redundant order by. #jimmyb
An example using PostgreSQL proving that concept
Creating a table with 1M records, with random dates in a day range from today - 90 and indexing by date
CREATE TABLE WITHDRAW AS
SELECT (random()*1000000)::integer AS IDT_WITHDRAW,
md5(random()::text) AS NAM_PERSON,
(NOW() - ( random() * (NOW() + '90 days' - NOW()) ))::timestamp AS DAT_CREATION, -- de hoje a 90 dias atras
(random() * 1000)::decimal(12, 2) AS NUM_VALUE
FROM generate_series(1,1000000);
CREATE INDEX WITHDRAW_DAT_CREATION ON WITHDRAW(DAT_CREATION);
Grouping by date truncated by day of month, restricting select by dates in a two days range
EXPLAIN
SELECT
DATE_TRUNC('DAY', W.dat_creation), COUNT(1), SUM(W.NUM_VALUE)
FROM WITHDRAW W
WHERE W.dat_creation >= (NOW() - INTERVAL '2 DAY')::timestamp
AND W.dat_creation < (NOW() - INTERVAL '1 DAY')::timestamp
GROUP BY 1
HashAggregate (cost=11428.33..11594.13 rows=11053 width=48)
Group Key: date_trunc('DAY'::text, dat_creation)
-> Bitmap Heap Scan on withdraw w (cost=237.73..11345.44 rows=11053 width=14)
Recheck Cond: ((dat_creation >= ((now() - '2 days'::interval))::timestamp without time zone) AND (dat_creation < ((now() - '1 day'::interval))::timestamp without time zone))
-> Bitmap Index Scan on withdraw_dat_creation (cost=0.00..234.97 rows=11053 width=0)
Index Cond: ((dat_creation >= ((now() - '2 days'::interval))::timestamp without time zone) AND (dat_creation < ((now() - '1 day'::interval))::timestamp without time zone))
Using a larger restriction date range, it chooses to apply a SORT
EXPLAIN
SELECT
DATE_TRUNC('DAY', W.dat_creation), COUNT(1), SUM(W.NUM_VALUE)
FROM WITHDRAW W
WHERE W.dat_creation >= (NOW() - INTERVAL '60 DAY')::timestamp
AND W.dat_creation < (NOW() - INTERVAL '1 DAY')::timestamp
GROUP BY 1
GroupAggregate (cost=116522.65..132918.32 rows=655827 width=48)
Group Key: (date_trunc('DAY'::text, dat_creation))
-> Sort (cost=116522.65..118162.22 rows=655827 width=14)
Sort Key: (date_trunc('DAY'::text, dat_creation))
-> Seq Scan on withdraw w (cost=0.00..41949.57 rows=655827 width=14)
Filter: ((dat_creation >= ((now() - '60 days'::interval))::timestamp without time zone) AND (dat_creation < ((now() - '1 day'::interval))::timestamp without time zone))
Just by adding ORDER BY 1 at the end (there is no significant difference)
GroupAggregate (cost=116522.44..132918.06 rows=655825 width=48)
Group Key: (date_trunc('DAY'::text, dat_creation))
-> Sort (cost=116522.44..118162.00 rows=655825 width=14)
Sort Key: (date_trunc('DAY'::text, dat_creation))
-> Seq Scan on withdraw w (cost=0.00..41949.56 rows=655825 width=14)
Filter: ((dat_creation >= ((now() - '60 days'::interval))::timestamp without time zone) AND (dat_creation < ((now() - '1 day'::interval))::timestamp without time zone))
PostgreSQL 10.3

It depends on the database vendor.
For example PostgreSQL does not automatically sort the grouped result.
Here you have to use order by to get the data sorted.
But Sybase and Microsoft SQL Server do. Here you can use order by to change the default sorting.

It definitely doesn't. I have experienced that, once one of my queries suddenly started to return not-ordered results, as the data in the table grows by.

I tried it. Adventureworks db of Msdn.
select HireDate, min(JobTitle)
from AdventureWorks2016CTP3.HumanResources.Employee
group by HireDate
Resuts :
2009-01-10Production Technician - WC40
2009-01-11Application Specialist
2009-01-12Assistant to the Chief Financial Officer
2009-01-13Production Technician - WC50<
It returns sorted data of hiredate, but you don't rely on GROUP BY to SORT under any circumstances.
for example; indexes can change this sorted data.
I added following index (hiredate, jobtitle)
CREATE NONCLUSTERED INDEX NonClusturedIndex_Jobtitle_hireddate ON [HumanResources].[Employee]
(
[JobTitle] ASC,
[HireDate] ASC
)
Result will change with same select query;
2006-06-30 Production Technician - WC60
2007-01-26 Marketing Assistant
2007-11-11 Engineering Manager
2007-12-05 Senior Tool Designer
2007-12-11 Tool Designer
2007-12-20 Marketing Manager
2007-12-26 Production Supervisor - WC60
You can download Adventureworks2016 at the following address
https://www.microsoft.com/en-us/download/details.aspx?id=49502

It depends on the number of records. When the records are less, Group by sorted automatically. When the records are more(more than 15) it required adding Order by clause

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

PostgreSQL aggregate query is very slow - sql

Related

PostgreSQL: get latest row for each time interval

How do I convince postgres to choose the MUCH more efficient of two nearly identical indexes (6 orders of magnitude more efficient)

Query optimization- How to achieve that in this query?

Poor performance on simple query

Does "group by" automatically guarantee "order by"?

Categories

Resources