I have a set of data for some tickets, with datetime of when they were opened and closed (or NULL if they are still open).
+------------------+------------------+
| opened_on | closed_on |
+------------------+------------------+
| 2019-09-01 17:00 | 2020-01-01 13:37 |
| 2020-04-14 11:00 | 2020-05-14 14:19 |
| 2020-03-09 10:00 | NULL |
+------------------+------------------+
We would like to generate a table of data showing the total count of tickets that were open through time, grouped by date. Something like the following:
+------------------+------------------+
| date | num_open |
+------------------+------------------+
| 2019-09-01 00:00 | 1 |
| 2020-09-02 00:00 | 1 |
| etc... | |
| 2020-01-01 00:00 | 0 |
| 2020-01-02 00:00 | 0 |
| etc... | |
| 2020-03-08 00:00 | 0 |
| 2020-03-09 00:00 | 1 |
| etc... | |
| 2020-04-14 00:00 | 2 |
+------------------+------------------+
Note that I am not sure about how the num_open is considered for a given date - should it be considered from the point of view of the end of the date or the start of it i.e. if one opened and closed on the same date, should that count as 0?
This is in Postgres, so I thought about using window functions for this, but trying to truncate by the date is making it complex. I have tried using a generate_series function to create the date series to join onto, but when I use the aggregate functions, I've "lost" access to the individual ticket datetimes.
You can use generate_series() to build the list of dates, and then a left join on inequality conditions to bring the table:
select s.dt, count(t.opened_on) num_open
from generate_series(date '2019-09-01', date '2020-09-01', '1 day') s(dt)
left join mytable t
on s.dt >= t.opened_on and s.dt < coalesce(t.closed_on, 'infinity')
group by s.dt
Actually, this seems a bit closer to what you want:
select s.dt, count(t.opened_on) num_open
from generate_series(date '2019-09-01', date '2020-09-01', '1 day') s(dt)
left join mytable t
on s.dt >= t.opened_on::date and s.dt < coalesce(t.closed_on::date, 'infinity')
group by s.dt
Related
I have a table that keep date data in insert_dttm as smalldatetime and i have a column with name count as integer in the same table. I need to get max count for every days between two datetimes day by day. How can i get it?
Your question is not clear. Is this what you want?
create table t(
insert_dttm smalldatetime,
name_count int);
insert into t values
('2022-01-01',1),
('2022-02-01',1),
('2022-03-01',1),
('2022-01-05',2),
('2022-02-20',2);
select
name_count,
min(insert_dttm) first_date,
max(insert_dttm) last_date,
datediff(day, min(insert_dttm), max(insert_dttm)) days_between
from t
group by name_count;
name_count | first_date | last_date | days_between
---------: | :--------------- | :--------------- | -----------:
1 | 2022-01-01 00:00 | 2022-03-01 00:00 | 59
2 | 2022-01-05 00:00 | 2022-02-20 00:00 | 46
db<>fiddle here
I'm using TimescaleDB in my PostgreSQL and I have the following two Tables:
windows_log
| windows_log_id | timestamp | computer_id | log_count |
------------------------------------------------------------------
| 1 | 2021-01-01 00:01:02 | 382 | 30 |
| 2 | 2021-01-02 14:59:55 | 382 | 20 |
| 3 | 2021-01-02 19:08:24 | 382 | 20 |
| 4 | 2021-01-03 13:05:36 | 382 | 10 |
| 5 | 2021-01-03 22:21:14 | 382 | 40 |
windows_reliability_score
| computer_id (FK) | timestamp | reliability_score |
--------------------------------------------------------------
| 382 | 2021-01-01 22:21:14 | 6 |
| 382 | 2021-01-01 22:21:14 | 6 |
| 382 | 2021-01-01 22:21:14 | 6 |
| 382 | 2021-01-02 22:21:14 | 1 |
| 382 | 2021-01-02 22:21:14 | 3 |
| 382 | 2021-01-03 22:21:14 | 7 |
| 382 | 2021-01-03 22:21:14 | 8 |
| 382 | 2021-01-03 22:21:14 | 9 |
Note: In both tables is indexed on the timestamp column (hypertable)
So I'm trying to get the average reliability_score for each time bucket, but it just gives me the average for everything, instead of the average per specific bucket...
This is my query:
SELECT time_bucket_gapfill(CAST(1 * INTERVAL '1 day' AS INTERVAL), wl.timestamp) AS timestamp,
COALESCE(SUM(log_count), 0) AS log_count,
AVG(reliability_score) AS reliability_score
FROM windows_log wl
JOIN reliability_score USING (computer_id)
WHERE wl.time >= '2021-01-01 00:00:00.0' AND wl.time < '2021-01-04 00:00:00.0'
GROUP BY timestamp
ORDER BY timestamp asc
This is the result I'm looking for:
| timestamp | log_count | reliability_score |
-------------------------------------------------------
| 2021-01-01 00:00:00 | 30 | 6 |
| 2021-01-02 00:00:00 | 20 | 2 |
| 2021-01-03 00:00:00 | 20 | 8 |
But this is what I get:
| timestamp | log_count | reliability_score |
-------------------------------------------------------
| 2021-01-01 00:00:00 | 30 | 5.75 |
| 2021-01-02 00:00:00 | 20 | 5.75 |
| 2021-01-03 00:00:00 | 20 | 5.75 |
Given what we can glean from your example, there's no simple way to do a join between these two tables, with the given functions, and achieve the results you want. The schema, as presented, just makes that difficult.
If this is really what your data/schema look like, then one solution is to use multiple CTE's to get the two values from each distinct table and then join based on bucket and computer.
WITH wrs AS (
SELECT time_bucket_gapfill('1 day', timestamp) AS bucket,
computer_id,
AVG(reliability_score) AS reliability_score
FROM windows_reliability_score
WHERE timestamp >= '2021-01-01 00:00:00.0' AND timestamp < '2021-01-04 00:00:00.0'
GROUP BY 1, 2
),
wl AS (
SELECT time_bucket_gapfill('1 day', wl.timestamp) bucket, wl.computer_id,
sum(log_count) total_logs
FROM windows_log wl
WHERE timestamp >= '2021-01-01 00:00:00.0' AND timestamp < '2021-01-04 00:00:00.0'
GROUP BY 1, 2
)
SELECT wrs.bucket, wrs.computer_id, reliability_score, total_logs
FROM wrs LEFT JOIN wl ON wrs.bucket = wl.bucket AND wrs.computer_id = wl.computer_id;
The filtering would have to be applied internally to each query because pushdown on the outer query likely wouldn't happen and so then you would scan the entire hypertable before the date filter is applied (not what you want I assume).
I tried to quickly re-create your sample schema, so I apologize if I got names wrong somewhere.
The main issue is that the join codition is on column computer_id, where both tables have the same values 382. Thus each column from table windows_log will be joined with each column from table reliability_score (Cartesian product of all rows). Also the grouping is done on column timestamp, which is ambigous, and is likely to be resolved to timestamp from windows_log. This leads to the result that average will use all values of reliability_score for each value of the timestamp from windows_log and explains the undesired result.
The resolution of the gropuing ambiguity, which resolved in favor the inner column, i.e., the table column, is explained in GROUP BY description in SELECT documentation:
In case of ambiguity, a GROUP BY name will be interpreted as an input-column name rather than an output column name.
To avoid having groups, which includes all rows matching on computer id, windows_log_id can be used for grouping. This will allow to bring log_count to the query result. And if it is desire to keep the output name timestamp, GROUP BY should use the reference to the position. For example:
SELECT time_bucket_gapfill('1 day'::INTERVAL, rs.timestamp) AS timestamp,
AVG(reliability_score) AS reliability_score,
log_count
FROM windows_log wl
JOIN reliability_score rs USING (computer_id)
WHERE rs.timestamp >= '2021-01-01 00:00:00.0' AND rs.timestamp < '2021-01-04 00:00:00.0'
GROUP BY 1, windows_log_id, log_count
ORDER BY timestamp asc
For ORDER BY it is not an issue, since the output name is used. From the same doc:
If an ORDER BY expression is a simple name that matches both an output column name and an input column name, ORDER BY will interpret it as the output column name.
I have pretty simple table which has 2 column. First one show time (timestamp), the second one show speed of car at that time (float8).
| DATE_TIME | SPEED |
|---------------------|-------|
| 2018-11-09 00:00:00 | 256 |
| 2018-11-09 01:00:00 | 659 |
| 2018-11-09 02:00:00 | 256 |
| other dates | xxx |
| 2018-11-21 21:00:00 | 651 |
| 2018-11-21 22:00:00 | 515 |
| 2018-11-21 23:00:00 | 849 |
Lets say we have period from 9 november to 21 november. How to group that period by week. In fact I want such result:
| DATE_TIME | AVG_SPEED |
|---------------------|-----------|
| 9-11 November | XXX |
| 12-18 November | YYY |
| 19-21 November | ZZZ |
I use PostgreSQL 10.4.
I use such SQL Statement to know the number of the week of the certain date:
SELECT EXTRACT(WEEK FROM TIMESTAMP '2018-11-09 00:00:00');
EDIT:
#tim-biegeleisen when I set period from '2018-11-01' to '2018-11-13' your sql statement return 2 result:
In fact I need such result:
2018-11-01 00:00:00 | 2018-11-04 23:00:00
2018-11-05 00:00:00 | 2018-11-11 23:00:00
2018-11-12 00:00:00 | 2018-11-13 05:00:00
As you can see in the calendar there are 3 week in that period.
We can do this using a calendar table. This answer assumes that a week begins with the first date in your data set. You could also do this assuming something else, e.g. a standard week according to something else.
WITH dates AS (
SELECT date_trunc('day', dd)::date AS dt
FROM generate_series
( '2018-11-09'::timestamp
, '2018-11-21'::timestamp
, '1 day'::interval) dd
),
cte AS (
SELECT t1.dt, t2.DATE_TIME, t2.SPEED,
EXTRACT(week from t1.dt) week
FROM dates t1
LEFT JOIN yourTable t2
ON t1.dt = t2.DATE_TIME::date
)
SELECT
MIN(dt)::text || '-' || MAX(dt) AS DATE_TIME,
AVG(SPEED) AS AVG_SPEED
FROM cte
GROUP BY
week
ORDER BY
MIN(dt);
Demo
I have a sales table:
SALES
|---------|-------------|-------------|
| order | ammount | date |
|---------|-------------|-------------|
| 001 | $2,000 | 2018-01-01 |
| 002 | $3,000 | 2018-01-01 |
| 003 | $1,500 | 2018-01-03 |
| 004 | $1,700 | 2018-01-04 |
| 005 | $1,800 | 2018-01-09 |
| 006 | $4,200 | 2018-01-11 |
|---------|-------------|-------------|
Aditionally, I have a table that groups said sales according to arbitrary time periods:
BUDGET PERIODS
|---------|-------------|--------------|
| ID | start_date | end_date |
|---------|-------------|--------------|
| 1 | 2018-01-01 | 2018-01-02 | <- notice this is a 2 day period...
| 2 | 2018-01-03 | 2018-01-05 | <-- but this is 3 days
|---------|-------------|--------------|
So, my result table query looked like this:
GROUPED SALES
|--------------|---------------|-----------------|
| start_date | end_date | ammount |
|--------------|---------------|-----------------|
| 2018-01-01 | 2018-01-02 | $5,000 |
| 2018-01-03 | 2018-01-05 | $3,200 |
|--------------|---------------|-----------------|
I accomplished it by a query as such:
SELECT
bp.start_date,
bp.end_date,
SUM(s.ammount)
FROM
budget_periods bp
LEFT JOIN
sales s ON s.date >= bp.start_date AND s.date <= bp.end_date
GROUP BY
start_date,
end_date
Everything is awesome, then. BUT, I notice that, of course, some sales are not included because they're not in budget periods. Hence, I want to include them "somewhere". I decided that that "somewhere" would be the week of the sale (using the week truncate function in Postgres). Hence, my grouped sales should look like this now:
GROUPED SALES
|--------------|---------------|-----------------|
| start_date | end_date | ammount |
|--------------|---------------|-----------------|
| 2018-01-01 | 2018-01-02 | $5,000 |
| 2018-01-03 | 2018-01-05 | $3,200 |
| 2018-01-08 | 2018-01-14 | $6,000 |
|--------------|---------------|-----------------|
Notice that if you truncate-to-week both 2018-01-09 and 2018-01-11, it shows 2018-01-08. To calculate my end_date, the budget period is "defaulted" to seven days, so it's six days later than the start_date.
So, I modified the query into a FULL JOIN like this:
SELECT
COALESCE(bp.start_date, DATE_TRUNC('WEEK', s.date)) AS new_start_date,
COALESCE(bp.end_date, DATE_TRUNC('WEEK', s.date) + INTERVAL '6 DAY') AS new_end_date,
SUM(s.ammount)
FROM
budget_periods bp
FULL JOIN
sales s ON s.date >= bp.start_date AND s.date <= bp.end_date
GROUP BY
new_start_date,
new_end_date
But then, the result table is the same as when I had a LEFT JOIN. How should I approach this?
Thank you for your time in reading such a long time to explain issue.
If you want all rows in sales then make that the first table in a LEFT JOIN. However, I think the FULL JOIN should work, as should this LEFT JOIN:
SELECT COALESCE(bp.start_date, DATE_TRUNC('WEEK', s.date)) as new_start_date,
COALESCE(bp.end_date, DATE_TRUNC('WEEK', s.date) + interval '6 day') as new_end_date,
SUM(s.amount)
FROM sales s LEFT JOIN
budget_periods bp
ON s.date >= bp.start_date AND s.date <= bp.end_date
GROUP BY new_start_date, new_end_date;
The only reason things would be filtered out of the FULL JOIN is through a WHERE clause, but you don't have one.
I have a structure like this
+-----+-----+------------+----------+------+----------------------+---+
| Row | id | date | time | hour | description | |
+-----+-----+------------+----------+------+----------------------+---+
| 1 | foo | 2018-03-02 | 19:00:00 | 8 | across single day | |
| 2 | bar | 2018-03-02 | 23:00:00 | 1 | end at midnight | |
| 3 | qux | 2018-03-02 | 10:00:00 | 3 | inside single day | |
| 4 | quz | 2018-03-02 | 23:15:00 | 2 | with minutes | |
+-----+-----+------------+----------+------+----------------------+---+
(I added the description column only to understand the context, for analysis purpose is useless)
Here is the statement to generate table
WITH table AS (
SELECT "foo" as id, CURRENT_dATE() AS date, TIME(19,0,0) AS time,8 AS hour
UNION ALL
SELECT "bar", CURRENT_dATE(), TIME(23,0,0), 1
UNION ALL
SELECT "qux", CURRENT_dATE(), TIME(10,0,0), 3
UNION ALL
SELECT "quz", CURRENT_dATE(), TIME(23,15,0), 2
)
SELECT * FROM table
Adding the hour value to the given time, I need to split the row on multiple ones, if the sum goes on the next day.
Jumps on multiple days are NOT to be considered, like +27 hours (this should simplify the scenario)
My initial idea was starting from adding the hours value in a date field, in order to obtain start and end limits of the interval
SELECT
id,
DATETIME(date, time) AS date_start,
DATETIME_ADD(DATETIME(date, time), INTERVAL hour HOUR) AS date_end
FROM table
here is the result
+-----+-----+---------------------+---------------------+---+
| Row | id | date_start | date_end | |
+-----+-----+---------------------+---------------------+---+
| 1 | foo | 2018-03-02T19:00:00 | 2018-03-03T03:00:00 | |
| 2 | bar | 2018-03-02T23:00:00 | 2018-03-03T00:00:00 | |
| 3 | qux | 2018-03-02T10:00:00 | 2018-03-02T13:00:00 | |
| 4 | quz | 2018-03-02T23:15:00 | 2018-03-03T01:15:00 | |
+-----+-----+---------------------+---------------------+---+
but now I'm stuck on how to proceed considering the existing interval.
Starting from this table, the rows should be splitted if the day change, like
+-----+-----+------------+-------------+----------+-------+--+
| Row | id | date | hourt_start | hour_end | hours | |
+-----+-----+------------+-------------+----------+-------+--+
| 1 | foo | 2018-03-02 | 19:00:00 | 00:00:00 | 5 | |
| 2 | foo | 2018-03-03 | 00:00:00 | 03:00:00 | 3 | |
| 3 | bar | 2018-03-02 | 23:00:00 | 00:00:00 | 1 | |
| 4 | qux | 2018-03-02 | 10:00:00 | 13:00:00 | 3 | |
| 5 | quz | 2018-03-02 | 23:15:00 | 00:00:00 | 0.75 | |
| 6 | quz | 2018-03-03 | 00:00:00 | 01:15:00 | 1.25 | |
+-----+-----+------------+-------------+----------+-------+--+
I tried to study a similar scenario from an already analyzed scenario, but I was unable to adapt it for handling the day component as well.
My whole final scenario will include both this approach and the other one analyzed in the other question (split on single days and then split on given breaks of hours), but I can approach these 2 themes separately, first query split with day (this question) and then split on time breaks (other question)
Interesting problem ... I tried the following:
Create a second table creating all the new rows starting at midnight
UNION ALL it with source table while correcting hours of old rows accordingly
Commented Result:
WITH table AS (
SELECT "foo" as id, CURRENT_dATE() AS date, TIME(19,0,0) AS time,8 AS hour
UNION ALL
SELECT "bar", CURRENT_dATE(), TIME(23,0,0), 1
UNION ALL
SELECT "qux", CURRENT_dATE(), TIME(10,0,0), 3
)
,table2 AS (
SELECT
id,
-- create datetime, add hours, then cast as date again
CAST( datetime_add( datetime(date, time), INTERVAL hour HOUR) AS date) date,
time(0,0,0) AS time -- losing minutes and seconds
-- substract hours to midnight
,hour - (24-EXTRACT(HOUR FROM time)) hour
FROM
table
WHERE
date != CAST( datetime_add( datetime(date,time), INTERVAL hour HOUR) AS date) )
SELECT
id
,date
,time
-- correct hour if midnight split
,IF(EXTRACT(hour from time)+hour > 24,24-EXTRACT(hour from time),hour) hour
FROM
table
UNION ALL
SELECT
*
FROM
table2
Hope, it makes sense.
Of course, if you need to consider jumps over multiple days, the correction fails :)
Here a possibile solution I came up starting from #Martin Weitzmann approach.
I used 2 different ways:
ids where there is a "jump" on the day
ids which are in the same day
and a final UNION ALL of the two data
I forgot to mention the first time that the hours value of the input value can be float (portion of hours) so I added that too.
#standardSQL
WITH
input AS (
-- change of day
SELECT "bap" as id, CURRENT_dATE() AS date, TIME(19,0,0) AS time, 8.0 AS hour UNION ALL
-- end at midnight
SELECT "bar", CURRENT_dATE(), TIME(23,0,0), 1.0 UNION ALL
-- inside single day
SELECT "foo", CURRENT_dATE(), TIME(10,0,0), 3.0 UNION ALL
-- change of day with minutes and float hours
SELECT "qux", CURRENT_dATE(), TIME(23,15,0), 2.5 UNION ALL
-- start from midnight
SELECT "quz",CURRENT_dATE(), TIME(0,0,0), 4.5
),
-- Calculate end_date and end_time summing hours value
table AS (
SELECT
id,
date AS start_date,
time AS start_time,
EXTRACT(DATE FROM DATETIME_ADD(DATETIME(date,time), INTERVAL CAST(hour*3600 AS INT64) SECOND)) AS end_date,
EXTRACT(TIME FROM DATETIME_ADD(DATETIME(date,time), INTERVAL CAST(hour*3600 AS INT64) SECOND)) AS end_time
FROM input
),
-- portion that start from start_time and end at midnight
start_to_midnight AS (
SELECT
id,
start_time,
start_date,
TIME(23,59,59) as end_time,
start_date as end_date
FROM
table
WHERE end_date > start_date
),
-- portion that start from midnightand end at end_time
midnight_to_end AS (
SELECT
id,
TIME(0,0,0) as start_time,
end_date as start_date,
end_time,
end_date
FROM
table
WHERE
end_date > start_date
-- Avoid rows that starts from 0:0:0 and ends to 0:0:0 (original row ends at 0:0:0)
AND end_time != TIME(0,0,0)
)
-- Union of the 3 tables
SELECT
id,
start_date,
start_time,
end_time
FROM (
SELECT id, start_time, end_time, start_date FROM table WHERE start_date = end_date
UNION ALL
SELECT id, start_time, end_time, start_date FROM start_to_midnight
UNION ALL
SELECT id, start_time, end_time, start_date FROM midnight_to_end
)
ORDER BY id,start_date,start_time
Here is the provided output
+-----+-----+------------+------------+----------+---+
| Row | id | start_date | start_time | end_time | |
+-----+-----+------------+------------+----------+---+
| 1 | bap | 2018-03-03 | 19:00:00 | 23:59:59 | |
| 2 | bap | 2018-03-04 | 00:00:00 | 03:00:00 | |
| 3 | bar | 2018-03-03 | 23:00:00 | 23:59:59 | |
| 4 | foo | 2018-03-03 | 10:00:00 | 13:00:00 | |
| 5 | qux | 2018-03-03 | 23:15:00 | 23:59:59 | |
| 6 | qux | 2018-03-04 | 00:00:00 | 01:45:00 | |
| 7 | quz | 2018-03-03 | 00:00:00 | 04:30:00 | |
+-----+-----+------------+------------+----------+---+