postgres query to group the records by hourly interval with date field - sql

I have a table that has some file input data with file_id and file_input_date. I want to filter / group these file_ids depending on file_input_date. The problem is my date is in format of YYYY-MM-DD HH:mm:ss and I want to go further to group them by hour and not just the date.
Edit: some sample data
file_id | file_input_date
597872 | 2023-01-12 16:06:22.92879
497872 | 2023-01-11 16:06:22.92879
397872 | 2023-01-11 16:06:22.92879
297872 | 2023-01-11 17:06:22.92879
297872 | 2023-01-11 17:06:22.92879
297872 | 2023-01-11 17:06:22.92879
297872 | 2023-01-11 18:06:22.92879
what I want to see is
1 for 2023-01-12 16:06
2 for 2023-01-11 16:06
3 for 2023-01-11 17:06
1 for 2023-01-11 18:06
the output format will be different but this kind of gives what I want.

You could convert the dates to strings with the format you want and group by it:
SELECT TO_CHAR(file_input_date, 'YYYY-MM-DD HH24:MI'), COUNT(*)
FROM mytable
GROUP BY TO_CHAR(file_input_date, 'YYYY-MM-DD HH24:MI')

To get to hour not minute:
create table date_grp (file_id integer, file_input_date timestamp);
INSERT INTO date_grp VALUES
(597872, '2023-01-12 16:06:22.92879'),
(497872, '2023-01-11 16:06:22.92879'),
(397872, '2023-01-11 16:06:22.92879'),
(297872, '2023-01-11 17:06:22.92879'),
(297872, '2023-01-11 17:06:22.92879'),
(297872, '2023-01-11 17:06:22.92879'),
(297872, '2023-01-11 18:06:22.92879');
SELECT
date_trunc('hour', file_input_date),
count(date_trunc('hour', file_input_date))
FROM
date_grp
GROUP BY
date_trunc('hour', file_input_date);
date_trunc | count
---------------------+-------
01/11/2023 18:00:00 | 1
01/11/2023 17:00:00 | 3
01/12/2023 16:00:00 | 1
01/11/2023 16:00:00 | 2
(4 rows)
Though if you want to minute
SELECT
date_trunc('minute', file_input_date),
count(date_trunc('minute', file_input_date))
FROM
date_grp
GROUP BY
date_trunc('minute', file_input_date);
date_trunc | count
---------------------+-------
01/11/2023 18:06:00 | 1
01/11/2023 16:06:00 | 2
01/12/2023 16:06:00 | 1
01/11/2023 17:06:00 | 3

Related

SQL time-series resampling

I have clickhouse table with some rows like that
id
created_at
6962098097124188161
2022-07-01 00:00:00
6968111372399976448
2022-07-02 00:00:00
6968111483775524864
2022-07-03 00:00:00
6968465518567268352
2022-07-04 00:00:00
6968952917160271872
2022-07-07 00:00:00
6968952924479332352
2022-07-09 00:00:00
I need to resample time-series and get count by date like this
created_at
count
2022-07-01 00:00:00
1
2022-07-02 00:00:00
2
2022-07-03 00:00:00
3
2022-07-04 00:00:00
4
2022-07-05 00:00:00
4
2022-07-06 00:00:00
4
2022-07-07 00:00:00
5
2022-07-08 00:00:00
5
2022-07-09 00:00:00
6
I've tried this
SELECT
arrayJoin(
timeSlots(
MIN(created_at),
toUInt32(24 * 3600 * 10),
24 * 3600
)
) as ts,
SUM(
COUNT(*)
) OVER (
ORDER BY
ts
)
FROM
table
but it counts all rows.
How can I get expected result?
why not use group by created_at
like
select count(*) from table_name group by toDate(created_at)

Create table with 15 minutes interval on date time in Snowflake

I am trying to create a table in Snowflake with 15 mins interval. I have tried with generator, but that's not give in the 15 minutes interval. Are there any function which I can use to generate and build this table for couple of years worth data.
Such as
Date
Hour
202-03-29
02:00 AM
202-03-29
02:15 AM
202-03-29
02:30 AM
202-03-29
02:45 AM
202-03-29
03:00 AM
202-03-29
03:15 AM
.........
........
.........
........
Thanks
Use following as time generator with 15min interval and then use other date time functions as needed to extract date part or time part in separate columns.
with CTE as
(select timestampadd(min,seq4()*15 ,date_trunc(hour, current_timestamp())) as time_count
from table(generator(rowcount=>4*24)))
select time_count from cte;
+-------------------------------+
| TIME_COUNT |
|-------------------------------|
| 2022-03-29 14:00:00.000 -0700 |
| 2022-03-29 14:15:00.000 -0700 |
| 2022-03-29 14:30:00.000 -0700 |
| 2022-03-29 14:45:00.000 -0700 |
| 2022-03-29 15:00:00.000 -0700 |
| 2022-03-29 15:15:00.000 -0700 |
.
.
.
....truncated output
| 2022-03-30 13:15:00.000 -0700 |
| 2022-03-30 13:30:00.000 -0700 |
| 2022-03-30 13:45:00.000 -0700 |
+-------------------------------+
There are many answers to this question h e r e already (those 4 are all this month).
But major point to note is you MUST NOT use SEQx() as the number generator (you can use it in the ORDER BY, but that is not needed). As noted in the doc's
Important
This function uses sequences to produce a unique set of increasing integers, but does not necessarily produce a gap-free sequence. When operating on a large quantity of data, gaps can appear in a sequence. If a fully ordered, gap-free sequence is required, consider using the ROW_NUMBER window function.
CREATE TABLE table_of_2_years_date_times AS
SELECT
date_time::date as date,
date_time::time as time
FROM (
SELECT
row_number() over (order by null)-1 as rn
,dateadd('minute', 15 * rn, '2022-03-01'::date) as date_time
from table(generator(rowcount=>4*24*365*2))
)
ORDER BY rn;
then selecting the top/bottom:
(SELECT * FROM table_of_2_years_date_times ORDER BY date,time LIMIT 5)
UNION ALL
(SELECT * FROM table_of_2_years_date_times ORDER BY date desc,time desc LIMIT 5)
ORDER BY 1,2;
DATE
TIME
2022-03-01
00:00:00
2022-03-01
00:15:00
2022-03-01
00:30:00
2022-03-01
00:45:00
2022-03-01
01:00:00
2024-02-28
22:45:00
2024-02-28
23:00:00
2024-02-28
23:15:00
2024-02-28
23:30:00
2024-02-28
23:45:00

How to aggregate rows in the range of timestamp in vertica db (vsql)

Suppose I have a table with data like this:
ts | bandwidth_bytes
---------------------+-----------------
2021-08-27 22:00:00 | 3792
2021-08-27 21:45:00 | 1164
2021-08-27 21:30:00 | 7062
2021-08-27 21:15:00 | 3637
2021-08-27 21:00:00 | 2472
2021-08-27 20:45:00 | 1328
2021-08-27 20:30:00 | 1932
2021-08-27 20:15:00 | 1434
2021-08-27 20:00:00 | 1530
2021-08-27 19:45:00 | 1457
2021-08-27 19:30:00 | 1948
2021-08-27 19:15:00 | 1160
I need to output something like this:
ts | bandwidth_bytes
---------------------+-----------------
2021-08-27 22:00:00 | 15,655
2021-08-27 21:00:00 | 7166
2021-08-27 20:00:00 | 6095
I want to do sum bandwidth_bytes over 1 hour timestamp of data.
I want to do this in vsql specifically.
More columns are present but for simplification I have shown only these two.
You can use date_trunc():
select [date_trunc('hour', ts)][1] as ts_hh, sum(bandwidth_bytes)
from t
group by ts_hh;
Use Vertica's lovely function TIME_SLICE().
You can't only go by hour, you can also go by slices of 2 or 3 hours, which DATE_TRUNC() does not offer.
You seem to want all between 20:00:01 and 21:00:00 to belong to a time slice of 21:00:00. In both DATE_TRUNC() and TIME_SLICE(), however, it's 20:00:00 to 20:59:59 that belongs to the same time slice. So I subtracted one second before applying TIME_SLICE() .
WITH
-- your in data ...
indata(ts,bandwidth_bytes) AS (
SELECT TIMESTAMP '2021-08-27 22:00:00',3792
UNION ALL SELECT TIMESTAMP '2021-08-27 21:45:00',1164
UNION ALL SELECT TIMESTAMP '2021-08-27 21:30:00',7062
UNION ALL SELECT TIMESTAMP '2021-08-27 21:15:00',3637
UNION ALL SELECT TIMESTAMP '2021-08-27 21:00:00',2472
UNION ALL SELECT TIMESTAMP '2021-08-27 20:45:00',1328
UNION ALL SELECT TIMESTAMP '2021-08-27 20:30:00',1932
UNION ALL SELECT TIMESTAMP '2021-08-27 20:15:00',1434
UNION ALL SELECT TIMESTAMP '2021-08-27 20:00:00',1530
UNION ALL SELECT TIMESTAMP '2021-08-27 19:45:00',1457
UNION ALL SELECT TIMESTAMP '2021-08-27 19:30:00',1948
UNION ALL SELECT TIMESTAMP '2021-08-27 19:15:00',1160
)
SELECT
TIME_SLICE(ts - INTERVAL '1 SECOND' ,1,'HOUR','END') AS ts
, SUM(bandwidth_bytes) AS bandwidth_bytes
FROM indata
GROUP BY 1
ORDER BY 1 DESC;
ts | bandwidth_bytes
---------------------+-----------------
2021-08-27 22:00:00 | 15655
2021-08-27 21:00:00 | 7166
2021-08-27 20:00:00 | 6095

generate date range between min and max dates Athena presto SQL sequence error

I'm attempting to generate a series of dates in Presto SQL (Athena) using unnest and sequence something similair to generate_series in postgres.
my table looks like
job_name | run_date
A | '2021-08-21'
A | '2021-08-25'
B | '2021-08-07'
B | '2021-08-24'
SELECT d.job_name, d.run_date
FROM (
VALUES
('A', '2021-08-21'), ('A', '2021-08-25'),
('B', '2021-08-07'), ('B', '2021-08-24')
) d(job_name, run_date)
I'm aiming for an output as follows
job_name | run_date
A | 2021-08-21
A | 2021-08-22
A | 2021-08-23
A | 2021-08-24
A | 2021-08-25
B | 2021-08-07
B | 2021-08-08
B | 2021-08-09
B | 2021-08-10
B | 2021-08-11
B | 2021-08-12
B | 2021-08-13
B | 2021-08-14
B | 2021-08-15
B | 2021-08-16
B | 2021-08-17
B | 2021-08-18
B | 2021-08-19
B | 2021-08-20
B | 2021-08-21
B | 2021-08-22
B | 2021-08-23
B | 2021-08-24
I've attempted to use the following query to achieve this - however I get an error when trying to unnest my date sequence
SELECT t.job_name, d.dte
FROM (SELECT job_name
, min(run_date) as mind
, max(run_date) as maxd
, SEQUENCE(min(run_date), max(run_date)) as date_arr
FROM job_log_table t
GROUP BY job_name
) jd
CROSS JOIN
UNNEST(jd.date_arr) d(dte)
LEFT JOIN job_log_table t
ON t.job_name = jd.job_name
AND t.latest_date = d.dte;
which yields the following error :
[HY000][100071] [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. [ErrorCategory:USER_ERROR, ErrorCode:SYNTAX_ERROR], Detail:SYNTAX_ERROR: line 5:14: Unexpected parameters (date, date) for function sequence. Expected: sequence(bigint, bigint, bigint) , sequence(bigint, bigint) , sequence(timestamp, timestamp, interval day to second) , sequence(timestamp, timestamp, interval year to month)
Is this a limitation of Athena's flavour of Presto SQL or have I made a school boy error somewhere?
You need to provide interval to generate date sequence (in this case interval '1' day):
WITH dataset AS (
SELECT *
FROM
( VALUES
('A', DATE '2021-08-21'), ('A', DATE '2021-08-25'),
('B', DATE '2021-08-07'), ('B', DATE '2021-08-24')
) AS d (job_name, run_date)
)
select job_name, sequence(min(run_date), max(run_date), interval '1' day) seq
from dataset
group by job_name
Output:
job_name
seq
A
[2021-08-21 00:00:00.000, 2021-08-22 00:00:00.000, 2021-08-23 00:00:00.000, 2021-08-24 00:00:00.000, 2021-08-25 00:00:00.000]
B
[2021-08-07 00:00:00.000, 2021-08-08 00:00:00.000, 2021-08-09 00:00:00.000, 2021-08-10 00:00:00.000, 2021-08-11 00:00:00.000, 2021-08-12 00:00:00.000, 2021-08-13 00:00:00.000, 2021-08-14 00:00:00.000, 2021-08-15 00:00:00.000, 2021-08-16 00:00:00.000, 2021-08-17 00:00:00.000, 2021-08-18 00:00:00.000, 2021-08-19 00:00:00.000, 2021-08-20 00:00:00.000, 2021-08-21 00:00:00.000, 2021-08-22 00:00:00.000, 2021-08-23 00:00:00.000, 2021-08-24 00:00:00.000]

compare oracle row count between different dates hourly

I am using this sql to query the count of rows hourly for three days ago ...
select trunc(sendtime ,'hh24') , count(*)
FROM t_sendedmsglog
where msgcontext like '%sm_%_tone_succ%' and sendtime > sysdate -3
group by trunc(sendtime ,'hh24')
order by trunc(sendtime ,'hh24') desc;
and the result shows like :
for example:
#|TRUNC(SENDTIME,'HH24')|COUNT(*)|
1|10/15/2020|12:00:00 PM|593|
2|10/15/2020|11:00:00 AM|889|
3|10/15/2020|10:00:00 AM|854|
4|10/15/2020|9:00:00 AM|1027|
5|10/15/2020|8:00:00 AM|8409|
.
.
.
12|10/15/2020|1:00:00 AM|101|
13|10/15/2020|281|
14|10/14/2020|11:00:00 PM|722|
15|10/14/2020|10:00:00 PM|1381|
16|10/14/2020|9:00:00 PM|2123|
.
.
25|10/14/2020|12:00:00 PM|1195|
26|10/14/2020|11:00:00 AM|1699|
27|10/14/2020|10:00:00 AM|747|
28|10/14/2020|9:00:00 AM|827|
.
.
40|10/13/2020|9:00:00 PM|2058|
41|10/13/2020|8:00:00 PM|2800|
but how I can make the result appear like below instead, so I can compare the count between different days for the same hour ?
hour|10/12/2020|10/13/2020|10/14/2020|count(*)
11:00:00 PM|618 |509 |722 |
10:00:00 PM|3181|1144|1381|
09:00:00 PM|3520|2058|2123|
08:00:00 PM|3688|2800|9347|
07:00:00 PM|3648|3166|3469|
06:00:00 PM|3628|2973|4518|
05:00:00 PM|3644|2429|3607|
04:00:00 PM|3652|3678|2291|
03:00:00 PM|1017|7711|819 |
02:00:00 PM|814 |7693|1310|
01:00:00 PM|856 |825 |848 |
12:00:00 PM|558 |1531|1195|
11:00:00 AM|0 |1132|1699|
10:00:00 AM|0 |732 |747 |
09:00:00 AM|0 |709 |827 |
08:00:00 AM|0 |1256|947 |
07:00:00 AM|0 |1465|1502|
06:00:00 AM|0 |749 |780 |
05:00:00 AM|0 |181 |169 |
04:00:00 AM|0 |46 |32 |
03:00:00 AM|0 |23 |34 |
02:00:00 AM|0 |46 |39 |
01:00:00 AM|0 |82 |81 |
00:00:00 AM|0 | |218 |
Use conditional aggregation:
select trunc(sendtime, 'hh24') , count(*) as total,
sum(case when trunc(sendtime) = trunc(sysdate) - interval '2' day then 1 else 0 end) as yester2day,
sum(case when trunc(sendtime) = trunc(sysdate) - interval '1' day then 1 else 0 end) as yesterday,
sum(case when trunc(sendtime) = trunc(sysdate) - interval '0' day then 1 else 0 end) as today
from t_sendedmsglog
where msgcontext like '%sm_%_tone_succ%' and
sendtime >= trunc(sysdate) - interval '2' day
group by trunc(sendtime, 'hh24')
order by trunc(sendtime, 'hh24') desc;
Note that I tweaked the date comparison in the where clause as well. In Oracle, sysdate has a time component, which you don't care about for the filtering purposes.