SELECT COUNT(*) WHERE DATE_PART(...) is Slow in PostgreSQL/TimescaleDB - sql

To find the number of rows in a table temperatures that exist for every hour, I'm running a series of SQL queries on my PostgreSQL 11.2 database with TimescaleDB 1.6.0 extension. temperatures is a TimescaleDB hypertable.
For example,
SELECT COUNT(*) FROM temperatures
WHERE DATE_PART('year', "timestamp") = 2020
AND DATE_PART('month', "timestamp") = 2
AND DATE_PART('day', "timestamp") = 2
AND DATE_PART('hour', "timestamp") = 0
AND DATE_PART('minute', "timestamp") = 0
Question: However, this query appears to be very slow (I think), taking about 6-8 seconds per query with no other queries running on this database. The table temperatures contains 11.5 million rows. There are about 100-2000 rows for each hour.
Looking for suggestions on improving the speed of such queries. Thanks!

Don't apply date functions on the timestamp column: this requires repeated computation for each row (5 total), and prevents the database from taking advantage of an existing index on the timestamp column:
This should be faster:
select count(*)
from temperatures
where
timestamp >= '2020-02-02 00:00:00'::timestamp
and timestamp < '2020-02-01 00:01:00'::timestamp
This query uses the half-open interval strategy to check the timestamp column against two constant values.

Related

How can I speed up this SQL query for finding previous entries "on this day"?

I'm trying to speed up this PostgreSQL query to find previous entries "on this day" in past years from a table. I currently have the query below:
select * from sample
where date_part('month', "timestamp") = date_part('month', now())
and date_part('day', "timestamp") = date_part('day', now())
order by "timestamp" desc;
This seems to get the intended result, but it is running much slower than desired. Is there a better approach for comparing the current month & day?
Also, would there be any changes to do a similar search for "this hour" over the past years? Similar to the following:
select * from sample
where date_part('month', "timestamp") = date_part('month', now())
and date_part('day', "timestamp") = date_part('day', now())
and date_part('hour', "timestamp") = date_part('hour', now())
order by "timestamp" desc;
The data is time-series in nature, using TimescaleDB as the database. Here is the current definition:
CREATE TABLE public.sample (
"timestamp" timestamptz NOT NULL DEFAULT now(),
entity varchar(256) NOT NULL,
quantity numeric NULL
);
CREATE INDEX sample_entity_time_idx ON public.sample (entity, "timestamp" DESC);
CREATE INDEX sample_time_idx ON public.sample ("timestamp" DESC);
If you need the same day in all previous years that I would guess that the query would return 1/365th of the rows; that's a 0.27% selectivity. Great.
With that selectivity an index can speed up the query significantly. Now, since you are selecting non-consecutive rows you'll need a functional index. I would try:
create index ix1 on sample ((date_part('doy', "timestamp")));
Then, you can modify your query to:
select *
from sample
where date_part('doy', "timestamp") = date_part('doy', now())
order by "timestamp" desc;
For the current hour in the past years you would have an even better selectivity of around 1/365/24; that is 0.01%. Awesome.
create index ix2 on sample (
(date_part('doy', "timestamp")),
(date_part('hour', "timestamp"))
);
Then, the new query could look like:
select *
from sample
where date_part('doy', "timestamp") = date_part('doy', now())
and date_part('hour', "timestamp") = date_part('hour', now())
order by "timestamp" desc;
Please post the execution plans of these queries with the indexes created. I'm curious to see how well the perform.

Most efficient way to retrieve data by timestamps

I'm using PostgreSQL 9.2.8.
I have table like:
CREATE TABLE foo
(
foo_date timestamp without time zone NOT NULL,
-- other columns, constraints
)
This table contains about 4.000.000 rows. One day data is about 50.000 rows.
My goal is to retrieve one day data as fast as possible.
I have created an index like:
CREATE INDEX foo_foo_date_idx
ON foo
USING btree
(date_trunc('day'::text, foo_date));
And now I'm selecting data like this (now() is just an example, i need data from ANY day):
select *
from process
where date_trunc('day'::text, now()) = date_trunc('day'::text, foo_date)
This query lasts about 20 s.
Is there any possiblity to obtain same data in shorter time?
It takes time to retrieve 50,000 rows. 20 seconds seems like a long time, but if the rows are wide, then that might be an issue.
You can directly index foo_date and use inequalities. So, you might try this version:
create index foo_foo_date_idx2 on foo(foo_date);
select p
from process p
where p.foo_date >= date_trunc('day', now()) and
p.foo_date < date_trunc('day', now() + interval '1 day');

calculating average with grouping based on time intervals

In a postgres table I have store the speed of an object with a 10 seconds interval. The values are not available for every 10 seconds during the day; so it could be that there is no line for today 16:39:40
How would the query look like to get an relation containing the average of the speed for 1 minute (or 30sec or n-sec) intervals for a given day, assuming the non-existing rows mean a speed of 0.
speed_table
id (int, pk)
ts (timestamp)
speed (numeric)
I've built this query but am getting stuck on some important parts:
SELECT
date_trunc('minute', ts) AS truncated,
avg(speed)
FROM speed_table AS t
WHERE ts >= '2014-06-21 00:00:00'
AND ts <= '2014-06-21 23:59:59'
AND condition2 = 'something'
GROUP BY date_trunc('minute', ts)
ORDER BY truncated
How can I alter the interval in something other then the result of the date_trunc function eg 5 minutes of 30 seconds?
How can I add the not available rows for the remaining of the day?
Simple and fast solution for this particular example:
SELECT date_trunc('minute', ts) AS minute
, sum(speed)/6 AS avg_speed
FROM speed_table AS t
WHERE ts >= '2014-06-21 0:0'
AND ts < '2014-06-20 0:0' -- exclude dangling corner case
AND condition2 = 'something'
GROUP BY 1
ORDER BY 1;
You need to factor in missing rows as "0 speed". Since a minute has 6 samples, just sum and divide by 6. Missing rows evaluate to 0 implicitly.
This returns no row for minutes with no rows at all.avg_speed for missing result rows is 0.
General query for arbitrary intervals
Works for all any interval listed in the manual for date_trunc():
SELECT date_trunc('minute', g.ts) AS ts_start
, avg(COALESCE(speed, 0)) AS avg_speed
FROM (SELECT generate_series('2014-06-21 0:0'::timestamp
, '2014-06-22 0:0'::timestamp
, '10 sec'::interval) AS ts) g
LEFT JOIN speed_table t USING (ts)
WHERE (t.condition2 = 'something' OR
t.condition2 IS NULL) -- depends on actual condition!
AND g.ts <> '2014-06-22 0:0'::timestamp -- exclude dangling corner case
GROUP BY 1
ORDER BY 1;
The problematic part is the additional unknown condition. You would need to define that. And decide whether missing rows supplied by generate_series should pass the test or not (which can be tricky!).
I let them pass in my example (and all other rows with a NULL values).
Compare:
PostgreSQL: running count of rows for a query 'by minute'
Arbitrary intervals:
Truncate timestamp to arbitrary intervals
For completely arbitrary intervals consider #Clodoaldo's math based on epoch values or use the often overlooked function width_bucket(). Example:
Aggregating (x,y) coordinate point clouds in PostgreSQL
Aggregating (x,y) coordinate point clouds in PostgreSQL
If you had issued some data it would be possible to test so this can contain errors. Point them including the error message so I can fix.
select
to_timestamp(
(extract(epoch from ts)::integer / (60 * 2)) * (60 * 2)
) as truncated,
avg(coalesce(speed, 0)) as avg_speed
from
generate_series (
'2014-06-21 00:00:00'::timestamp,
'2014-06-22'::timestamp - interval '1 second',
'10 seconds'
) ts (ts)
left join
speed_table t on ts.ts = t.ts and condition2 = 'something'
group by 1
order by 1
The example is grouped by 30 seconds. It is number of seconds since 1970-01-01 00:00:00 (epoch) divided by 120. When you want to group by 5 minutes divide it by 12 (60 / 5).
The generate_series in the example is generating timestamps at 1 second interval. It is left outer joined to the speed table so it fills the gaps. When the speed is null then coalesce returns 0.

Postgres SQL select a range of records spaced out by a given interval

I am trying to determine if it is possible, using only sql for postgres, to select a range of time ordered records at a given interval.
Lets say I have 60 records, one record for each minute in a given hour. I want to select records at 5 minute intervals for that hour. The resulting rows should be 12 records each one 5 minutes apart.
This is currently accomplished by selecting the full range of records and then looping thru the results and pulling out the records at the given interval. I am trying to see if I can do this purly in sql as our db is large and we may be dealing with tens of thousands of records.
Any thoughts?
Yes you can. Its really easy once you get the hang of it. I think its one of jewels of SQL and its especially easy in PostgreSQL because of its excellent temporal support. Often, complex functions can turn into very simple queries in SQL that can scale and be indexed properly.
This uses generate_series to draw up sample time stamps that are spaced 1 minute apart. The outer query then extracts the minute and uses modulo to find the values that are 5 minutes apart.
select
ts,
extract(minute from ts)::integer as minute
from
( -- generate some time stamps - one minute apart
select
current_time + (n || ' minute')::interval as ts
from generate_series(1, 30) as n
) as timestamps
-- extract the minute check if its on a 5 minute interval
where extract(minute from ts)::integer % 5 = 0
-- only pick this hour
and extract(hour from ts) = extract(hour from current_time)
;
ts | minute
--------------------+--------
19:40:53.508836-07 | 40
19:45:53.508836-07 | 45
19:50:53.508836-07 | 50
19:55:53.508836-07 | 55
Notice how you could add an computed index on the where clause (where the value of the expression would make up the index) could lead to major speed improvements. Maybe not very selective in this case, but good to be aware of.
I wrote a reservation system once in PostgreSQL (which had lots of temporal logic where date intervals could not overlap) and never had to resort to iterative methods.
http://www.amazon.com/SQL-Design-Patterns-Programming-Focus/dp/0977671542 is an excellent book that goes has lots of interval examples. Hard to find in book stores now but well worth it.
Extract the minutes, convert to int4, and see, if the remainder from dividing by 5 is 0:
select *
from TABLE
where int4 (date_part ('minute', COLUMN)) % 5 = 0;
If the intervals are not time based, and you just want every 5th row; or
If the times are regular and you always have one record per minute
The below gives you one record per every 5
select *
from
(
select *, row_number() over (order by timecolumn) as rown
from tbl
) X
where mod(rown, 5) = 1
If your time records are not regular, then you need to generate a time series (given in another answer) and left join that into your table, group by the time column (from the series) and pick the MAX time from your table that is less than the time column.
Pseudo
select thetimeinterval, max(timecolumn)
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
And further join it back to the table for the full record (assuming unique times)
select t.* from
tbl inner join
(
select thetimeinterval, max(timecolumn) timecolumn
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
) y on tbl.timecolumn = y.timecolumn
How about this:
select min(ts), extract(minute from ts)::integer / 5
as bucket group by bucket order by bucket;
This has the advantage of doing the right thing if you have two readings for the same minute, or your readings skip a minute. Instead of using min even better would be to use one of the the first() aggregate functions-- code for which you can find here:
http://wiki.postgresql.org/wiki/First_%28aggregate%29
This assumes that your five minute intervals are "on the fives", so to speak. That is, that you want 07:00, 07:05, 07:10, not 07:02, 07:07, 07:12. It also assumes you don't have two rows within the same minute, which might not be a safe assumption.
select your_timestamp
from your_table
where cast(extract(minute from your_timestamp) as integer) in (0,5);
If you might have two rows with timestamps within the same minute, like
2011-01-01 07:00:02
2011-01-01 07:00:59
then this version is safer.
select min(your_timestamp)
from your_table
group by (cast(extract(minute from your_timestamp) as integer) / 5)
Wrap either of those in a view, and you can join it to your base table.

How to filter table to date when it has a timestamp with time zone format?

I have a very large dataset - records in the hundreds of millions/billions.
I would like to filter the data in this column - i am only showing 2 records of millions:
arrival_time
2019-04-22 07:36:09.870+00
2019-06-07 09:46:09.870+00
How can i filter the data in this column to only the date part? as in I would like to filter where the arrival_time is 2019-04-22 as this would give me the first record and any other records which have the matching date of 2019-04-22?
I have tried to cast the column to timestamp::date = "2019-04-22" but this has been costly and does not work well given i have such vast amounts of records.
sample code is:
select
*
from
mytable
where
arrival_time::timestamp::date = '2019-09-30'
again very costly if i cast to date format as this will be done before the filtering!
any ideas? I am using postgresql and pgadmin4
This query:
where (arrival_time::timestamp)::date = '2019-09-30'
Is converting arrival_time to another type. That generally precludes the use of index and makes it harder for the optimizer to choose the best execution path.
Instead, compare to same data type:
where arrival_time >= '2019-09-30'::timestamp and
arrival_time >= ('2019-09-30'::timestamp + interval '1 day')
You can try to filter for the upper and lower bounds of that day.
...
WHERE arrival_time >= '2019-04-22'::timestamp
AND arrival_time < '2019-04-23'::timestamp
...
Like that an index on arrival_time should be usable and help to improve performance.