left join causes huge increase in time for query resolution - sql

The following query helps me to calculate the average of historical values distributed on even time intervals.
EXPLAIN ANALYZE SELECT start_date as date, AVG(hcv1.value::float) as value
FROM generate_series(cast('2017-01-01' as abstime), cast('2017-12-01' as abstime), interval '86400 seconds') start_date
LEFT JOIN history_values hv
ON (
hv.variable_id = 3 AND
hv.created_at BETWEEN start_date AND start_date + interval '86400 seconds'
)
GROUP BY start_date
ORDER BY start_date
Here the report of the query: https://explain.depesz.com/s/q29a
Now if I try to add an extra column value2 pointing to another variable_id the query time goes from 2 seconds to 150 seconds:
EXPLAIN ANALYZE SELECT start_date as date,
AVG(hv1.value::float) as value1,
AVG(hv2.value::float) as value2
FROM generate_series(cast('2017-01-01' as abstime), cast('2017-12-01' as abstime), interval '86400 seconds') start_date
LEFT JOIN history_values hv1
ON (
hv1.variable_id = 2 AND
hv.created_at BETWEEN start_date AND start_date + interval '86400 seconds'
)
LEFT JOIN history_values hv2
ON (
hv2.variable_id = 3 AND
hv.created_at BETWEEN start_date AND start_date + interval '86400 seconds'
)
GROUP BY start_date
ORDER BY start_date
Here is the report: https://explain.depesz.com/s/V1sV
Could anybody tell me why? I was really expecting the time to be around 4 seconds, not almost 75 times more.
Also note that:
SELECT COUNT(*) FROM history_values WHERE variable_id = 2 -- ~25k records
SELECT COUNT(*) FROM history_values WHERE variable_id = 3 -- ~25k records

You're not adding an extra column, you're adding another join condition. And you don't need that extra join anyway..
Try instead, just filtering the avg()
EXPLAIN ANALYZE
SELECT start_date as date,
AVG(hv1.value::float) FILTER ( WHERE hv1.variable_id = 1 ) as value1,
AVG(hv2.value::float) FILTER ( WHERE hv1.variable_id = 2 ) as value2
FROM generate_series(
cast('2017-01-01' as abstime)
, cast('2017-12-01' as abstime),
, interval '86400 seconds'
) AS start_date
LEFT JOIN history_values hv1
ON (
hv1.created_at >= cast('2017-01-01' as abstime) AND
hv1.created_at <= cast('2017-12-01' as abstime) AND
hv1.created_at >= start_date AND
hv1.created_at < start_date + interval '86400 seconds'
)
GROUP BY start_date
ORDER BY start_date
As a side note, you should not ever be using abstime. That should be for internal use only. Instead, I would use
EXPLAIN ANALYZE
SELECT start_date::date AS date,
AVG(hv1.value::float) FILTER ( WHERE hv1.variable_id = 1 ) as value1,
AVG(hv2.value::float) FILTER ( WHERE hv1.variable_id = 2 ) as value2
FROM generate_series(
timestamp with time zone '2017-01-01',
timestamp with time zone '2017-12-01',
interval '1 day'
) AS start_date
LEFT JOIN history_values hv1
ON (
hv1.created_at BETWEEN (
timestamp with time zone '2017-01-01'
AND timestamp with time zone '2017-12-01'
) AND
hv1.created_at >= start_date AND
hv1.created_at < start_date + interval '1 day' AND
hv1.variable_id IN (1,2)
)
GROUP BY start_date
ORDER BY start_date
I would also think you could collapse those ranges down..
EXPLAIN ANALYZE
SELECT start_date::date AS date,
AVG(hv1.value::float) FILTER ( WHERE hv1.variable_id = 1 ) as value1,
AVG(hv2.value::float) FILTER ( WHERE hv1.variable_id = 2 ) as value2
FROM generate_series(
timestamp with time zone '2017-01-01',
timestamp with time zone '2017-12-01' - interval '1 day'
interval '1 day'
) AS start_date
LEFT JOIN history_values hv1
ON hv1.created_at BETWEEN start_date AND (start_date + interval '1 day' )
AND hv1.variable_id IN (1,2)
GROUP BY start_date
ORDER BY start_date
In the future, please ask questions specific to PostgreSQL on http://dba.stackexchange.com. I would flag this for migration there. The admins will gladly move it.

Related

Rewrite PostgreSQL query using CTE:

I have the following code to pull records from a daterange in PostgreSQL, it works as intended. The "end date" is determined by the "date" column from the last record, and the "start date" is calculated by subtracting a 7-day interval from the "end date".
SELECT date
FROM files
WHERE daterange((
(SELECT date FROM files ORDER BY date DESC LIMIT 1) - interval '7 day')::date, -- "start date"
(SELECT date FROM files ORDER BY date DESC LIMIT 1)::date, -- "end date"
'(]') #> date::date
ORDER BY date ASC
I'm trying to rewrite this query using CTEs, so I can replace those subqueries with values such as end_date and start_date. Is this possible using this method or should I look for other alternatives like variables? I'm still learning SQL.
WITH end_date AS
(
SELECT date FROM files ORDER BY date DESC LIMIT 1
),
start_date AS
(
SELECT date FROM end_date - INTERVAL '7 day'
)
SELECT date
FROM files
WHERE daterange(
start_date::date,
end_date::date,
'(]') #> date::date
ORDER BY date ASC
Right now I'm getting the following error:
ERROR: syntax error at or near "-"
LINE 7: SELECT date FROM end_date - INTERVAL '7 day'
You do not need two CTEs, it's one just fine, which can be joined to filter data.
WITH RECURSIVE files AS (
SELECT CURRENT_DATE date, 1 some_value
UNION ALL
SELECT (date + interval '1 day')::date, some_value + 1 FROM files
WHERE date < (CURRENT_DATE + interval '1 month')::date
),
dates AS (
SELECT
(MAX(date) - interval '7 day')::date from_date,
MAX(date) to_date
FROM files
)
SELECT f.* FROM files f
JOIN dates d ON daterange(d.from_date, d.to_date, '(]') #> f.date
You even can make it to be a daterange initially in CTE and use it later like this
WITH dates AS (
SELECT
daterange((MAX(date) - interval '7 day')::date, MAX(date), '(]') range
FROM files
)
SELECT f.* FROM files f
JOIN dates d ON d.range #> f.date
Here the first CTE is used just to generate some data.
It will get all file lines for dates in the last week, excluding from_date and including to_date.
date
some_value
2022-09-26
25
2022-09-27
26
2022-09-28
27
2022-09-29
28
2022-09-30
29
2022-10-01
30
2022-10-02
31
I think this is what you want:
WITH end_date AS
(
SELECT date FROM files ORDER BY date DESC LIMIT 1
),
start_date AS
(
SELECT date - INTERVAL '7 day' as date
FROM end_date
)
SELECT F.date, S.date startDate, E.date endDate
FROM files F
JOIN start_date S on F.date >= S.date
JOIN end_date E on F.date <= E.date
ORDER BY date ASC;
I hope I'm not repeating anything, but if I understand your problem correctly I think this will work:
with cte as (
select max (date)::date as max_date from files
)
select date
from files
cross join cte
where date >= max_date - 7
Or perhaps even:
select date
from files
where date >= (select max (date)::date - 7 from files)
Since you have already determined that the CTE has the max date, there is really no need to further bound it with a between, <= or range. You can simply say anything after that date minus 7 days.
The error in your code above is because you want this:
SELECT date - INTERVAL '7 day' as date FROM end_date
And not this:
SELECT date FROM end_date - INTERVAL '7 day'
You are subtracting from the table, which doesn't make sense.

How to get the percentage change from same time 7 days ago?

I have a big PostgreSQL database with time series data.
I query the data with a resample to one hour. What I want is to compare the the mean value from the last hour to the value 7 days ago at the same time and I don't know how to do it.
This is what I use to get the latest value.
SELECT DATE_TRUNC('hour', datetime) AS time, AVG(value) as value, id FROM database
GROUP BY id, time
WHERE datetime > now()- '01:00:00'::interval
You can use a CTE to calculate last week's average in the same time period, then join on id and hour.
with last_week as
(
SELECT
id,
extract(hour from datetime) as time,
avg(value) as avg_value
FROM my_table
where DATE_TRUNC('hour', datetime) =
(date_trunc('hour', now() - interval '7 DAYS'))
group by 1,2
)
select n.id,
DATE_TRUNC('hour', n.datetime) AS time_now,
avg(n.value) as avg_now,
t.avg_value as avg_last_week
from my_table n
left join last_week t
on t.id = n.id
and t.time = extract(hour from n.datetime)
where datetime > now()- '01:00:00'::interval
group by 1,2,4
order by 1
I'm making a few assumptions on how your data appear.
**EDIT - JUST NOTICED YOU ASKED FOR PERCENT CHANGE
Showing change as decimal...
select id,
extract(hour from time_now) as hour_now,
avg_now,
avg_last_week,
coalesce(((avg_now - avg_last_week) / avg_last_week), 0) AS CHANGE
from (
with last_week as
(
SELECT
id,
extract(hour from datetime) as time,
avg(value) as avg_value
FROM my_table
where DATE_TRUNC('hour', datetime) =
(date_trunc('hour', now() - interval '7 DAYS'))
group by 1,2
)
select n.id,
DATE_TRUNC('hour', n.datetime) AS time_now,
avg(n.value) as avg_now,
t.avg_value as avg_last_week
from my_table n
left join last_week t
on t.id = n.id
and t.time = extract(hour from n.datetime)
where datetime > now()- '01:00:00'::interval
group by 1,2,4
)z
group by 1,2,3,4
order by 1,2
db-fiddle found here: https://www.db-fiddle.com/f/rWJATypGzHPZ8sG2vXAGXC/4

how to get date different in postgres using date_part option

How to get date time difference in PostgreSQL
I am using below syntax
select id, A_column,B_column,
(SELECT count(*) AS count_days_no_weekend
FROM generate_series(B_column ::timestamp , A_column ::timestamp, interval '1 day') the_day
WHERE extract('ISODOW' FROM the_day) < 5) * 24 + DATE_PART('hour', B_column::timestamp-A_column ::timestamp ) as hrs
FROM table req where id='123';
If A_column=2020-05-20 00:00:00 and B_column=2020-05-15 00:00:00 I want to get 72(in hours).
Is there any possibility to skip weekends(Saturday and Sunday) in first one, it means to get the result as 72 hours(exclude weekend hours)
i am getting 0
But i need to get 72 hours
And if If A_column=2020-08-15 12:00:00 and B_column=2020-08-15 00:00:00 I want to get 12(in hours).
One option uses a lateral join and generate_series() to enumerate each and every hour between the two timestamps, while filtering out week-ends:
select t.a_column, t.b_column, h.count_hours_no_weekend
from mytable t
cross join lateral (
select count(*) count_hours_no_weekend
from generate_series(t.b_column::timestamp, t.a_column::timestamp, interval '1 hour') s(col)
where extract('isodow' from s.col) < 5
) h
where id = 123
I would attack this by calculating the weekend hours to let the database deal with daylight savings time. I would then subtract the intervening weekend hours from the difference between the two date values.
with weekend_days as (
select *, date_part('isodow', ddate) as dow
from table1
cross join lateral
generate_series(
date_trunc('day', b_column),
date_trunc('day', a_column),
interval '1 day') as gs(ddate)
where date_part('isodow', ddate) in (6, 7)
), weekend_time as (
select id,
sum(
least(ddate + interval '1 day', a_column) -
greatest(ddate, b_column)
) as we_ival
from weekend_days
group by id
)
select t.id,
a_column - b_column as raw_difference,
coalesce(we_ival, interval '0') as adjustment,
a_column - b_column -
coalesce(we_ival, interval '0') as adj_difference
from weekend_time w
left join table1 t on t.id = w.id;
Working fiddle.

Need to split datetime ranges by each day

I have a table that needs to be split on the basis of datetime
Input Table
ID| Start | End
--------------------------------------------
A | 2019-03-04 23:18:04| 2019-03-04 23:21:25
--------------------------------------------
A | 2019-03-04 23:45:05| 2019-03-05 00:15:14
--------------------------------------------
Required Output
ID| Start | End
--------------------------------------------
A | 2019-03-04 23:18:04| 2019-03-04 23:21:25
--------------------------------------------
A | 2019-03-04 23:45:05| 2019-03-04 23:59:59
--------------------------------------------
A | 2019-03-05 00:00:00| 2019-03-05 00:15:14
--------------------------------------------
Thanks!!
Try this below code. This will only work if the start and end date fall in two consecutive day. Not if the start and end date difference is more than 1 day.
MSSQL:
SELECT ID,[Start],[End]
FROM Input_Table A
WHERE DATEDIFF(DD,[Start],[End]) = 0
UNION ALL
SELECT ID,[Start], CAST(CAST(CAST([Start] AS DATE) AS VARCHAR(MAX)) +' 23:59:59' AS DATETIME)
FROM Input_Table A
WHERE DATEDIFF(DD,[Start],[End]) > 0
UNION ALL
SELECT ID,CAST(CAST([End] AS DATE) AS DATETIME),[End]
FROM Input_Table A
WHERE DATEDIFF(DD,[Start],[End]) > 0
ORDER BY 1,2,3
PostgreSQL:
SELECT ID,
TO_TIMESTAMP(startDate,'YYYY-MM-DD HH24:MI:SS'),
TO_TIMESTAMP(endDate, 'YYYY-MM-DD HH24:MI:SS')
FROM mytemp A
WHERE DATE_PART('day', endDate::date) -
DATE_PART('day',startDate::date) = 0
UNION ALL
SELECT ID,
TO_TIMESTAMP(startDate,'YYYY-MM-DD HH24:MI:SS'),
TO_TIMESTAMP(CONCAT(CAST(CAST (startDate AS DATE) AS VARCHAR) ,
' 23:59:59') , 'YYYY-MM-DD HH24:MI:SS')
FROM mytemp A
WHERE DATE_PART('day', endDate::date) -
DATE_PART('day',startDate::date) > 0
UNION ALL
SELECT ID,
TO_TIMESTAMP(CAST(CAST (endDate AS DATE) AS VARCHAR) ,
'YYYY-MM-DD HH24:MI:SS') ,
TO_TIMESTAMP(endDate,'YYYY-MM-DD HH24:MI:SS')
FROM mytemp A
WHERE DATE_PART('day', endDate::date) -
DATE_PART('day',startDate::date) > 0;
PostgreSQL Demo Here
demo:db<>fiddle
This works even when range crosses more than one day
WITH cte AS (
SELECT
id,
start_time,
end_time,
gs,
lag(gs) over (PARTITION BY id ORDER BY gs) -- 2
FROM
a
LEFT JOIN LATERAL
generate_series(start_time::date + 1, end_time::date, interval '1 day') gs --1
ON TRUE
)
SELECT -- 3
id,
COALESCE(lag, start_time) AS start_time,
gs - interval '1 second'
FROM
cte
WHERE gs IS NOT NULL
UNION
SELECT DISTINCT ON (id) -- 4
id,
CASE WHEN start_time::date = end_time::date THEN start_time ELSE end_time::date END, -- 5
end_time
FROM
cte
CTE: the generate_series function generates one row per day new day. So, there is no value if there is no date change
CTE: the lag() window function allows to move the current date value into the next row (the current end is the next start)
With this data set you can calculate the new start and end values. If there is no gs value: There is no date change. This ignored at this point. For all cases with date changes: If there is no lag value, it is the beginning (so it cannot got a previous value). In this case, the normal start_time is taken, otherwise it is a new day which takes the date break time. The end_time is taken with the last second of the day (interval - '1 second')
The second part: Because of the date breaks there is always one additional record which need to be unioned. The last record is from the beginning of the end_time (so cast to date). The CASE clause combines this step with the case of no date change which has been ignored so far. So if start_time and end_time are at the same date, here the original start_time is taken.
Unfortunately, Redshift doesn't have a convenient way to generate a series of numbers. If you table is big enough, you can use it to generate numbers. "Big enough" means that the number of rows is greater than the longest span. Perhaps another table would work, if not this one.
Once you have that, you can use this logic:
with n as (
select row_number() over () - 1 as n
from t
)
select t.id,
greatest(t.s, date_trunc('day', t.s) + n.n * interval '1 day') as s,
least(t.e, date_trunc('day', t.s) + (n.n + 1) * interval '1 day' - interval '1 second') as e
from t join
n
on t.e >= date_trunc('day', t.s) + n.n * interval '1 day';
Here is a db<>fiddle. It uses an old version of Postgres, but not quite old enough for Redshift.
Simulate loop for interval generation using recursive CTE, i.e. take range from start to midnight in seed row, take another day in subsequent rows etc.
with recursive input as (
select 'A' as id, timestamp '2019-03-04 23:18:04' as s, timestamp '2019-03-04 23:21:25' as e union
select 'A' as id, timestamp '2019-03-04 23:45:05' as s, timestamp '2019-03-05 00:15:14' as e union
select 'B' as id, timestamp '2019-03-06 23:45:05' as s, timestamp '2019-03-08 00:15:14' as e union
select 'C' as id, timestamp '2019-03-10 23:45:05' as s, timestamp '2019-03-15 00:15:14' as e
), generate_id as (
select row_number() over () as unique_id, * from input
), rec (unique_id, id, s, e) as (
select unique_id, id, s, least(e, s::date::timestamp + interval '1 day')
from generate_id seed
union
select remaining.unique_id, remaining.id, previous.e, least(remaining.e, previous.e::date::timestamp + interval '1 day')
from rec as previous
join generate_id remaining on previous.unique_id = remaining.unique_id and previous.e < remaining.e
)
select id, s, e from rec
order by id,s,e
Note:
your id column appears not to be unique, so I added custom unique_id column. If id was unique, CTE generate_id was unnecessary. Uniqueness is unavoidable for recursive query to work.
close-open range is better for representation of such data, rather than close-close range. So end time in my query returns 00:00:00, not 23:59:59. If it's not suitable for you, modify query as an exercise.
UPDATE: query works on Postgres. OP originally tagged question postgres, then changed tag to redshift.

update table with dates with month

There's a table dates_calendar:
id | date
-------------------------
13 | 2016-10-23 00:00:00
14 | 2016-10-24 00:00:00
I need to update this table and insert dates until the next month counting from the last date in the table. E.g. last date is 2016-10-24 00:00:00 - I need to insert dates till 2016-10-31. After that (the last date now is 2016-10-31) next statement call should insert dates till 2016-11-30 and so on.
Example of my SQL code, but it inserts 30 days all the time.
INSERT INTO dates_calendar (date)
VALUES (
generate_series(
(SELECT date FROM dates_calendar ORDER BY date DESC LIMIT 1) + interval '1 day',
(SELECT date FROM dates_calendar ORDER BY date DESC LIMIT 1) + interval '1 month',
'1 day'
)
);
I'm using PostgreSQL. As well would be fine to get rid of a duplicated SELECT statement of the last date.
insert into dates_calendar (date)
select dates::date
from (
select max(date)::date+ 1 next_day, '1day'::interval one_day, '1month'::interval one_month
from dates_calendar
) s,
generate_series(
next_day,
date_trunc('month', next_day)+ one_month- one_day,
one_day) dates;
To calculate the first and last date you need to insert you can use this query:
select max(date) + interval '1' day as first_day,
date_trunc('month', max(date) + interval '1' month) - interval '1' day as last_day
from dates_calendar
The expression date_trunc('month', max(date) + interval '1' month) calculates the start date of the next month. Subtracting one day from that will give you the last day of that month.
This can then be used to generate the list of dates:
with from_to (first_day, last_day) as (
select max(date) + interval '1' day,
date_trunc('month', max(date) + interval '1' month) - interval '1' day
from dates_calendar
)
select dt
from generate_series( (select first_day from from_to), (select last_day from from_to), interval '1' day) as t(dt);
And finally this can be used to insert the generated rows into the table:
with from_to (first_day, last_day) as (
select max(date) + interval '1' day,
date_trunc('month', max(date) + interval '1' month) - interval '1' day
from dates_calendar
)
insert into dates_calendar (date)
select dt
from generate_series( (select first_day from from_to), (select last_day from from_to), interval '1' day) as t(dt);
with max_date (d) as (select max(date)::date from dates_calendar)
insert into dates_calendar (date)
select d
from generate_series (
(select d from max_date) + 1,
(select date_trunc('month', d + interval '1 month')::date - 1 from max_date),
'1 day'
) g(d)