PostgreSQL: get latest row for each time interval - sql

I have the following table. It is stored as a TimescaleDB hypertable. Data rate is 1 row per second.
CREATE TABLE electricity_data
(
"time" timestamptz NOT NULL,
meter_id integer REFERENCES meters NOT NULL,
import_low double precision,
import_normal double precision,
export_low double precision,
export_normal double precision,
PRIMARY KEY ("time", meter_id)
)
I would like to get the latest row in a given time interval, over a period of time.
For instance the latest record each month for the previous year.
The following query works but is slow:
EXPLAIN ANALYZE
SELECT
DISTINCT ON (bucket)
time_bucket('1 month', "time", 'Europe/Amsterdam') AS bucket,
import_low,
import_normal,
export_low,
export_normal
FROM electricity_data
WHERE meter_id = 1
AND "time" BETWEEN '2022-01-01T00:00:00 Europe/Amsterdam' AND '2023-01-01T00:00:00 Europe/Amsterdam'
ORDER BY bucket DESC
Unique (cost=0.42..542380.99 rows=200 width=40) (actual time=3654.263..59130.398 rows=12 loops=1)
-> Custom Scan (ChunkAppend) on electricity_data (cost=0.42..514045.41 rows=11334231 width=40) (actual time=3654.260..58255.396 rows=11161474 loops=1)
Order: time_bucket('1 mon'::interval, electricity_data.""time"", 'Europe/Amsterdam'::text, NULL::timestamp with time zone, NULL::interval) DESC
-> Index Scan using _hyper_12_1533_chunk_electricity_data_time_idx on _hyper_12_1533_chunk (cost=0.42..11530.51 rows=255951 width=40) (actual time=3654.253..3986.885 rows=255582 loops=1)
Index Cond: ((""time"" >= '2021-12-31 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
Filter: (meter_id = 1)
Rows Removed by Filter: 24330
-> Index Scan Backward using ""1529_1849_electricity_data_pkey"" on _hyper_12_1529_chunk (cost=0.42..25777.81 rows=604553 width=40) (actual time=1.468..1810.493 rows=603808 loops=1)
Index Cond: ((""time"" >= '2021-12-31 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone) AND (meter_id = 1))
(...)
Planning Time: 57.424 ms
JIT:
Functions: 217
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 43.496 ms, Inlining 18.805 ms, Optimization 2348.206 ms, Emission 1288.087 ms, Total 3698.594 ms
Execution Time: 59176.016 ms
Getting the latest row for a single month is instantaneous:
EXPLAIN ANALYZE
SELECT
"time",
import_low,
import_normal,
export_low,
export_normal
FROM electricity_data
WHERE meter_id = 1
AND "time" BETWEEN '2022-12-01T00:00:00 Europe/Amsterdam' AND '2023-01-01T00:00:00 Europe/Amsterdam'
ORDER BY "time" DESC
LIMIT 1
Limit (cost=0.42..0.47 rows=1 width=40) (actual time=0.048..0.050 rows=1 loops=1)
-> Custom Scan (ChunkAppend) on electricity_data (cost=0.42..11530.51 rows=255951 width=40) (actual time=0.047..0.048 rows=1 loops=1)
Order: electricity_data.""time"" DESC
-> Index Scan using _hyper_12_1533_chunk_electricity_data_time_idx on _hyper_12_1533_chunk (cost=0.42..11530.51 rows=255951 width=40) (actual time=0.046..0.046 rows=1 loops=1)
Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
Filter: (meter_id = 1)
-> Index Scan Backward using ""1529_1849_electricity_data_pkey"" on _hyper_12_1529_chunk (cost=0.42..25777.81 rows=604553 width=40) (never executed)
Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone) AND (meter_id = 1))
(...)
-> Index Scan using _hyper_12_1512_chunk_electricity_data_time_idx on _hyper_12_1512_chunk (cost=0.42..8.94 rows=174 width=40) (never executed)
Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
Filter: (meter_id = 1)
Planning Time: 2.162 ms
Execution Time: 0.152 ms
Is there a way to execute the query above for each month or custom time interval? Or is there a different way to speed up the first query?
Edit
The following query takes 10 seconds, which is much better, but still slower than the manual approach. An index does not seem to make a difference.
EXPLAIN ANALYZE
SELECT MAX("time") AS "time"
FROM electricity_data
WHERE meter_id = 1
AND "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
GROUP BY time_bucket('1 month', "time", 'Europe/Amsterdam');
(... plan removed)
Planning Time: 50.463 ms
JIT:
Functions: 451
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 76.476 ms, Inlining 0.000 ms, Optimization 13.849 ms, Emission 416.718 ms, Total 507.043 ms
Execution Time: 9910.058 ms

I'd recommend using the last aggregate and a continuous aggregate to solve this problem.
Like the previous poster, I'd also recommend an index on meter, time rather than the other way around, you can do this in your table definition by just changing the order of keys in your primary key definition.
CREATE TABLE electricity_data
(
"time" timestamptz NOT NULL,
meter_id integer REFERENCES meters NOT NULL,
import_low double precision,
import_normal double precision,
export_low double precision,
export_normal double precision,
PRIMARY KEY ( meter_id, "time")
);
But that's a bit off topic. The basic query you'll want to do is something like:
SELECT time_bucket('1 day', "time", 'Europe/Amsterdam'),
meter_id,
last(electricity_data, "time")
FROM electricity_data
GROUP BY 1, 2;
This is a bit confusing until you realize that the table itself is also a type in PostgreSQL - so you can ask for and return a composite type from this call to the last aggregate, which will get the latest value in the month or day or whatever you want.
Then you have to be able to treat that as a row again, so you can expand that by using parentheses and a .* which is how composite types can be expanded in PG.
SELECT time_bucket('1 month', "time", 'Europe/Amsterdam'),
meter_id,
(last(electricity_data, "time")).*
FROM electricity_data
GROUP BY 1,2;
Now, in order to speed things up, you can turn that into a continuous aggregate which will make things much faster.
CREATE MATERIALIZED VIEW last_meter_month WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month', "time", 'Europe/Amsterdam'),
(last(electricity_data, "time")).*
FROM electricity_data
GROUP BY 1, meter_id;
You'll note that I took the meter_id out of the initial select list because that's gonna come from our composite type and I don't need the redundant column, nor can I have two columns with the same name in a view, but I did keep meter_id in my group by.
So that'll speed things up nicely, but, if I were you, I might actually think about doing this on a daily basis and creating a hierarchical continuous aggregate for this type of thing.
CREATE MATERIALIZED VIEW last_meter_day WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', "time", 'Europe/Amsterdam'),
(last(electricity_data, "time")).*
FROM electricity_data
GROUP BY 1, meter_id;
CREATE MATERIALIZED VIEW last_meter_month WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month',time_bucket, 'Europe/Amsterdam') as month_bucket,
(last(last_meter_day, time_bucket)).*
FROM last_meter_day
GROUP BY 1, meter_id;
The reason for that is that we can't really refresh a monthly continuous aggregate all that often, it's much easier to refresh a daily aggregate and then roll that up into a monthly aggregate more frequently. You could also just have the daily aggregate and roll up to month on the fly in your query as that would be at most 30 days per meter, but of course that won't be as performant.
You'll then have to create continuous aggregate policies for these based on what you want to have happen on refresh.
I'd also suggest, depending on what you're trying to do with this, that you might want to take a look at counter_agg as it might be useful for you. I also recently wrote a post in our forum about how to use it with electricity meters that might be helpful for you depending on how you're processing this data.

You can try an approach that uses a subquery to get the timestamp of the latest time in each bucket. Then, join that to your detail table.
SELECT meter_id, MAX("time") "time"
FROM electricity_data
WHERE "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
GROUP BY meter_id,
time_bucket('1 month', "time", 'Europe/Amsterdam')
That gets you a virtual table with the latest time for each meter for each time bucket (month in this case). It can be accelerated with this index, the same as your primary key but with the columns in the opposite order. With the columns in that order the query can be satisfied with a relatively quick index scan.
CREATE INDEX meter_time ON electricity_data (meter_id, "time")
Then join that to your detail table. Like this.
SELECT d.meter_id
time_bucket('1 month', d."time", 'Europe/Amsterdam') AS bucket,
d."time",
d.import_low,
d.import_normal,
d.export_low,
d.export_normal
FROM electricity_data d
JOIN (
SELECT meter_id, MAX("time") "time"
FROM electricity_data
WHERE "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
GROUP BY meter_id,
time_bucket('1 month', "time", 'Europe/Amsterdam')
) last ON d."time" = last."time"
AND d.meter_id = last.meter_id
ORDER BY d.meter_id, bucket DESC
(I'm not completely sure of the syntax in TimeScaleDB for columns that have the same name as reserved words like time, so this isn't tested.)
If you want just one meter, put a WHERE clause right before the last ORDER BY clause.

The other answers are likely more useful in most cases.
I wanted a solution that works for any interval,
without the need for continuous aggregates.
I ended up with the following query, using a lateral join. I use the lag function to compute energy consumption/generation in a time bucket (omitted below). Variables $__interval, $__timeFrom() and $__timeTo() specify the chosen bucket interval and time range.
SELECT bucket, import_low, import_normal, export_low, export_normal
FROM (
SELECT
tstzrange(
-- Could also use date_trunc or date_bin
time_bucket(INTERVAL '$__interval', d, 'Europe/Amsterdam'),
time_bucket(INTERVAL '$__interval', d + INTERVAL '$__interval', 'Europe/Amsterdam'),
'(]' -- We use an inclusive upper bound, because a meter reading on the upper boundary applies to the previous period
) bucket
FROM generate_series($__timeFrom(), $__timeTo(), INTERVAL '$__interval') d
) buckets
LEFT JOIN LATERAL (
SELECT *
FROM electricity_data
WHERE meter_id = $meterId AND "time" <# bucket
ORDER BY "time" DESC
LIMIT 1
) elec ON true
ORDER BY bucket;

Related

Postgres ignoring index using the COALESCE function

I have the following table, with ~4 million rows:
CREATE TABLE members (
id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
updated_at TIMESTAMP WITH TIME ZONE,
-- other columns...
);
I use this following query to extract the latest updated rows:
SELECT *
FROM members
WHERE COALESCE(updated_at, created_at) > current_timestamp - interval '24 hours'
This query is obviously slow, so I created an index, but it is not used by Postgres:
CREATE INDEX members_updated_or_created_at ON members(COALESCE(updated_at, created_at));
Here's the execution plan:
Seq Scan on members (cost=0.00..171792.01 rows=1326991 width=1826) (actual time=62.663..22064.805 rows=1 loops=1)
Filter: (COALESCE(updated_at, created_at) > (CURRENT_TIMESTAMP - '48:00:00'::interval))
Rows Removed by Filter: 3980971
Planning Time: 0.123 ms
JIT:
Functions: 2
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 7.481 ms, Inlining 0.000 ms, Optimization 8.067 ms, Emission 35.308 ms, Total 50.857 ms
Execution Time: 22072.906 ms
I don't understand why it's doing a table scan instead of using an index scan. I also tried to select fewer fields, and adding a limit, but it didn't change anything.
EDIT:
So it seems like the index is not used because I'm fetching many columns that are not present in the index (select *).
I tried to do the same with the updated_at column, and this time, the index is used if the only column I select is the "updated_at" column (Index Only Scan), it is not used if I include another column though.
What I don't understand, is why don't I get the same behavior with the coalesce function?
This query results in a full table scan
SELECT coalesce(updated_at, created_at)
FROM members
WHERE coalesce(updated_at, created_at) > current_timestamp - interval '7 days';
This query results in an Index Only Scan (index on updated_at)
SELECT updated_at
FROM members
WHERE updated_at > current_timestamp - interval '7 days';
I found the solution, in order to force the DB to use my index, I added an "ORDER BY" clause, and it seems to work:
SELECT *
FROM members
WHERE coalesce(updated_at, created_at) > current_timestamp - interval '7 days'
ORDER BY coalesce(updated_at, created_at) DESC;
Index Scan Backward using members_updated_or_created_at on members (cost=0.43..446282.78 rows=1326991 width=1834) (actual time=8.367..8.369 rows=1 loops=1)
Index Cond: (COALESCE(updated_at, created_at) > (CURRENT_TIMESTAMP - '7 days'::interval))
Planning Time: 0.261 ms
JIT:
Functions: 5
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 2.065 ms, Inlining 0.000 ms, Optimization 0.825 ms, Emission 7.484 ms, Total 10.375 ms
Execution Time: 10.524 ms

How do I convince postgres to choose the MUCH more efficient of two nearly identical indexes (6 orders of magnitude more efficient)

I have some huge postgres tables that seem to be using the wrong index. In a big way. Like, in a 'if I remove one index, the query performance goes up by six orders of magnitude' way. (For those of you counting, that's ~1ms to 32 minutes.) We vacuum and analyze this table daily.
Simplified table for easier parsing:
action
-----
id bigint
org bigint
created datetime without time zone
action_time datetime without time zone
Query:
SELECT min(created) FROM action
WHERE org = 10
AND created > NOW() - INTERVAL '25 hour'
AND action_time < NOW() - INTERVAL '1 hour'
Two indexes:
action (org, action_time, created)
action (org, created, action_time)
Let's say an org creates 200k events a day, and has been running for a year. That means that 99.99% of the items in the action table were created more than an hour ago, and action_time is almost always roughly around when they are created, with much less than 0.01% of them more than a few minutes earlier. This means that around 99.99% of rows satisfy the action_time < NOW() - INTERVAL '1 hour' clause.
On the other hand, around 0.3% of rows were created in the last 25 hours, thereby satisfying the created > NOW() - INTERVAL '25 hour' clause.
So guess which index it uses?
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=55.45..55.46 rows=1 width=8)
InitPlan 1 (returns $0)
-> Limit (cost=0.71..55.45 rows=1 width=8)
-> Index Only Scan using ix_action_org_action_time_created on action (cost=0.71..11498144.88 rows=210051 width=8)
Index Cond: ((org = 50) AND (action_time IS NOT NULL) AND (action_time < (now() - '01:00:00'::interval)) AND (created > (now() - '25:00:00'::interval)))
(5 rows)
Yup! It loads the entire index and scans through literally 99.99% of it searching for the 0.3%, rather than loading 0.3% of the other index and then examining it for the matching 99.99% of those records. Of course, if I drop the second index, it immediately starts using the correct one and the performance goes up accordingly.
Postgres doesn't support index hinting, and as far as I can tell none of the workarounds that the postgres dev team says are much better than index hinting would help here in any way. Possibly there is some way to tell it that created has a roughly uniform distribution over years (and so does action_time)? Would that even help, given that I can't even imagine how it wouldn't know that already? Is there anything else that could help?
edit: explain verbose:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=57.48..57.49 rows=1 width=8)
Output: $0
InitPlan 1 (returns $0)
-> Limit (cost=0.72..57.48 rows=1 width=8)
Output: action.action_time
-> Index Only Scan using ix_action_org_action_time_created on public.action (cost=0.72..11851726.67 rows=208788 width=8)
Output: action.action_time
Index Cond: ((action.org = 10) AND (action.action_time IS NOT NULL) AND (action.action_time < (now() - '01:00:00'::interval)) AND (action.created > (now() - '25:00:00'::interval)) AND (action.created < now()))
(8 rows)
I'll add explain (analyze, buffers, verbose) should this ever finish running. Sigh.
edit2: business logic: action_time is ALMOST always before created. 99.999+% of the time. No other requirements, and even that one isn't perfect.

SQL relative versus absolute date impact query time

I run the following query and it takes 50 seconds.
select created_at, currency, balance
from YYY
where id in (ZZZ) and currency = 'XXX'
and created_at >= '2020-08-28'
order by created_at desc
limit 1;
explain:
Limit (cost=100.12..1439.97 rows=1 width=72)
-> Foreign Scan on yyy (cost=100.12..21537.65 rows=16 width=72)
Filter: (("substring"((object_key)::text, '\w+:(\d+):.*'::text))::integer = 723120)
Then I run the following query and it "infinite" time. Too long to wait until the end.
select created_at, currency, balance
from YYY
where id in (ZZZ) and currency = 'XXX'
and created_at >= NOW() - INTERVAL '1 DAY'
order by created_at desc
limit 1;
explain:
Limit (cost=53293831.90..53293831.91 rows=1 width=72)
-> Result (cost=53293831.90..53293987.46 rows=17284 width=72)
-> Sort (cost=53293831.90..53293840.54 rows=17284 width=556)
Sort Key: yyy.created_at DESC
-> Foreign Scan on yyy (cost=100.00..53293814.62 rows=17284 width=556)
Filter: ((created_at >= (now() - '1 day'::interval)) AND (("substring"((object_key)::text, '\w+:(\d+):.*'::text))::integer = 723120))
What could make this huge difference between those query. I know that index are used to improve performance. What can we infer from here?
Any contribution would be appreciated.
With a literal, the optimizer has an easy game to plan an efficient data access using the right index.
With an expression like NOW - INTERVAL '4 DAY', you run at least into two challenges:
It is a stable, not an immutable, expression. Let alone a literal.
The expression is a TIMESTAMP WITH TIME ZONE, not a DATE, and you need an implicit type cast.
You just make the life of the optimizer difficult ...
I just created a single-column table yyy with 12 years' worth of distinct dates in my PostgreSQL database. No indexes. Already here, you see a difference in the cost of the explain plan.
$ psql -c "explain select * from yyy where created_at >= '2020-08-28'"
QUERY PLAN
------------------------------------------------------
Seq Scan on yyy (cost=0.00..74.79 rows=126 width=4)
Filter: (created_at >= '2020-08-28'::date)
And:
$ psql -c "explain select * from yyy where created_at >= now() - interval '4 day'"
QUERY PLAN
--------------------------------------------------------
Seq Scan on yyy (cost=0.00..96.70 rows=126 width=4)
Filter: (created_at >= (now() - '4 days'::interval))
(2 rows)
It will be a much worse difference with the existence of an index ....

PostgreSQL aggregate query is very slow

I have a table, which contains a timestamp column and a source column varchar(20). I insert a couple thousand entries into this table every hour and I would like to show an aggregate on this data. My query looks like this:
EXPLAIN (analyze, buffers) SELECT
count(*) AS count
FROM frontend_car c
WHERE date_created at time zone 'cet' > now() at time zone 'cet' - interval '1 week'
GROUP BY source, date_trunc('hour', c.date_created at time zone 'CET')
ORDER BY source ASC, date_trunc('hour', c.date_created at time zone 'CET') DESC
I have already created an index like so:
create index source_date_created on
table_name(
(date_created AT TIME ZONE 'CET') DESC,
source ASC,
date_trunc('hour', date_created at time zone 'CET') DESC
);
And the output of my ANALYZE is:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=142888.08..142889.32 rows=495 width=16) (actual time=10242.141..10242.188 rows=494 loops=1)
Sort Key: source, (date_trunc('hour'::text, timezone('CET'::text, date_created)))
Sort Method: quicksort Memory: 63kB
Buffers: shared hit=27575 read=28482
-> HashAggregate (cost=142858.50..142865.93 rows=495 width=16) (actual time=10236.393..10236.516 rows=494 loops=1)
Group Key: source, date_trunc('hour'::text, timezone('CET'::text, date_created))
Buffers: shared hit=27575 read=28482
-> Bitmap Heap Scan on frontend_car c (cost=7654.61..141002.20 rows=247507 width=16) (actual time=427.894..10122.438 rows=249056 loops=1)
Recheck Cond: (timezone('cet'::text, date_created) > (timezone('cet'::text, now()) - '7 days'::interval))
Rows Removed by Index Recheck: 141143
Heap Blocks: exact=27878 lossy=26713
Buffers: shared hit=27575 read=28482
-> Bitmap Index Scan on frontend_car_source_date_created (cost=0.00..7592.74 rows=247507 width=0) (actual time=420.415..420.415 rows=249056 loops=1)
Index Cond: (timezone('cet'::text, date_created) > (timezone('cet'::text, now()) - '7 days'::interval))
Buffers: shared hit=3 read=1463
Planning time: 2.430 ms
Execution time: 10242.379 ms
(17 rows)
Clearly this is too slow and in my mind it should be able to be computed only using indexes, if I use only either time or source for aggregation, it is reasonably fast, but together somehow its slow.
This is on a rather small VPS with only 512MB of ram and the database presently contains about 700k rows.
From what I read it seems that the majority of time is spent on recheck, which means that the index did not fit in memory?
It sounds like what you really need is a separate aggregate table that gets records inserted or updated via a trigger in your detailed table. The summary table would have your source column, a date/time field to hold only the date and hour portion (truncating any minutes), and finally the running count.
As records are inserted, this summary table gets updated, then your query could be directly on this table. Since it will already be pre-aggregated by source, date and hour, your query would just need to apply the where clause and order that by the source.
I'm not fluent at all with postgresql, but am sure they have their own means of insert triggers. So, if you have 1000s of entries each hour, and say you have 10 sources. Your entire result set of this aggregate summary table would only be 24(hrs) * ex 50(sources) = 1200 records per day vs 50k, 60k, 70k+ per day. If you then need the exact details per a given date/hour basis, you could then drill-into the details on an as-needed basis. But really, how many "sources" are you dealing with is unclear.
I would strongly consider this as a solution to your needs.

Does "group by" automatically guarantee "order by"?

Does "group by" clause automatically guarantee that the results will be ordered by that key? In other words, is it enough to write:
select *
from table
group by a, b, c
or does one have to write
select *
from table
group by a, b, c
order by a, b, c
I know e.g. in MySQL I don't have to, but I would like to know if I can rely on it accross the SQL implementations. Is it guaranteed?
group by does not order the data neccessarily. A DB is designed to grab the data as fast as possible and only sort if necessary.
So add the order by if you need a guaranteed order.
An efficient implementation of group by would perform the group-ing by sorting the data internally. That's why some RDBMS return sorted output when group-ing. Yet, the SQL specs don't mandate that behavior, so unless explicitly documented by the RDBMS vendor I wouldn't bet on it to work (tomorrow). OTOH, if the RDBMS implicitly does a sort it might also be smart enough to then optimize (away) the redundant order by. #jimmyb
An example using PostgreSQL proving that concept
Creating a table with 1M records, with random dates in a day range from today - 90 and indexing by date
CREATE TABLE WITHDRAW AS
SELECT (random()*1000000)::integer AS IDT_WITHDRAW,
md5(random()::text) AS NAM_PERSON,
(NOW() - ( random() * (NOW() + '90 days' - NOW()) ))::timestamp AS DAT_CREATION, -- de hoje a 90 dias atras
(random() * 1000)::decimal(12, 2) AS NUM_VALUE
FROM generate_series(1,1000000);
CREATE INDEX WITHDRAW_DAT_CREATION ON WITHDRAW(DAT_CREATION);
Grouping by date truncated by day of month, restricting select by dates in a two days range
EXPLAIN
SELECT
DATE_TRUNC('DAY', W.dat_creation), COUNT(1), SUM(W.NUM_VALUE)
FROM WITHDRAW W
WHERE W.dat_creation >= (NOW() - INTERVAL '2 DAY')::timestamp
AND W.dat_creation < (NOW() - INTERVAL '1 DAY')::timestamp
GROUP BY 1
HashAggregate (cost=11428.33..11594.13 rows=11053 width=48)
Group Key: date_trunc('DAY'::text, dat_creation)
-> Bitmap Heap Scan on withdraw w (cost=237.73..11345.44 rows=11053 width=14)
Recheck Cond: ((dat_creation >= ((now() - '2 days'::interval))::timestamp without time zone) AND (dat_creation < ((now() - '1 day'::interval))::timestamp without time zone))
-> Bitmap Index Scan on withdraw_dat_creation (cost=0.00..234.97 rows=11053 width=0)
Index Cond: ((dat_creation >= ((now() - '2 days'::interval))::timestamp without time zone) AND (dat_creation < ((now() - '1 day'::interval))::timestamp without time zone))
Using a larger restriction date range, it chooses to apply a SORT
EXPLAIN
SELECT
DATE_TRUNC('DAY', W.dat_creation), COUNT(1), SUM(W.NUM_VALUE)
FROM WITHDRAW W
WHERE W.dat_creation >= (NOW() - INTERVAL '60 DAY')::timestamp
AND W.dat_creation < (NOW() - INTERVAL '1 DAY')::timestamp
GROUP BY 1
GroupAggregate (cost=116522.65..132918.32 rows=655827 width=48)
Group Key: (date_trunc('DAY'::text, dat_creation))
-> Sort (cost=116522.65..118162.22 rows=655827 width=14)
Sort Key: (date_trunc('DAY'::text, dat_creation))
-> Seq Scan on withdraw w (cost=0.00..41949.57 rows=655827 width=14)
Filter: ((dat_creation >= ((now() - '60 days'::interval))::timestamp without time zone) AND (dat_creation < ((now() - '1 day'::interval))::timestamp without time zone))
Just by adding ORDER BY 1 at the end (there is no significant difference)
GroupAggregate (cost=116522.44..132918.06 rows=655825 width=48)
Group Key: (date_trunc('DAY'::text, dat_creation))
-> Sort (cost=116522.44..118162.00 rows=655825 width=14)
Sort Key: (date_trunc('DAY'::text, dat_creation))
-> Seq Scan on withdraw w (cost=0.00..41949.56 rows=655825 width=14)
Filter: ((dat_creation >= ((now() - '60 days'::interval))::timestamp without time zone) AND (dat_creation < ((now() - '1 day'::interval))::timestamp without time zone))
PostgreSQL 10.3
It depends on the database vendor.
For example PostgreSQL does not automatically sort the grouped result.
Here you have to use order by to get the data sorted.
But Sybase and Microsoft SQL Server do. Here you can use order by to change the default sorting.
It definitely doesn't. I have experienced that, once one of my queries suddenly started to return not-ordered results, as the data in the table grows by.
I tried it. Adventureworks db of Msdn.
select HireDate, min(JobTitle)
from AdventureWorks2016CTP3.HumanResources.Employee
group by HireDate
Resuts :
2009-01-10Production Technician - WC40
2009-01-11Application Specialist
2009-01-12Assistant to the Chief Financial Officer
2009-01-13Production Technician - WC50<
It returns sorted data of hiredate, but you don't rely on GROUP BY to SORT under any circumstances.
for example; indexes can change this sorted data.
I added following index (hiredate, jobtitle)
CREATE NONCLUSTERED INDEX NonClusturedIndex_Jobtitle_hireddate ON [HumanResources].[Employee]
(
[JobTitle] ASC,
[HireDate] ASC
)
Result will change with same select query;
2006-06-30 Production Technician - WC60
2007-01-26 Marketing Assistant
2007-11-11 Engineering Manager
2007-12-05 Senior Tool Designer
2007-12-11 Tool Designer
2007-12-20 Marketing Manager
2007-12-26 Production Supervisor - WC60
You can download Adventureworks2016 at the following address
https://www.microsoft.com/en-us/download/details.aspx?id=49502
It depends on the number of records. When the records are less, Group by sorted automatically. When the records are more(more than 15) it required adding Order by clause