I would like to subtract 2 timestamp to get the hours between them. I have use days_between function but it returns me an error of Invalid operation:function days_between has no timezone setup. Below is the sample table and timestamps that I want to subtract.
job_number timestamp 1 timestamp 2
123456789 2020-03-16 16:59:26 2020-03-17 10:58:25
134232125 2020-03-18 08:57:05 2020-03-19 01:47:26
The HOURS_BETWEEN function is the cleanest way to find the number of full hours between two timestamps in DB2
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0061478.html
The HOURS_BETWEEN function returns the number of full hours between the specified arguments.
For example
VALUES
( HOURS_BETWEEN('2020-03-17-10.58.25', '2019-03-16-16.59.26')
, HOURS_BETWEEN('2020-03-19-01.47.26', '2019-03-18-08.57.05')
)
returns
1 |2
----|----
8801|8800
Note that the value is negative if the first value is less than the second value in the function
Also note that this function does not exist in version of Db2 (for LUW) lower than 11.1
You can use TIMESTAMPDIFF() function with numeric-expression is equal to 8 to represent hour difference :
SELECT job_number, TIMESTAMPDIFF(8,CHAR(timestamp2 - timestamp1)) AS ts_diff
FROM T
Demo
TIMESTAMPDIFF function has quite a specific implementation. See Table 3. TIMESTAMPDIFF computations.
I've set earlier year (2019) for timestamp 1 values deliberately.
WITH TAB (job_number, timestamp1, timestamp2) AS
(
VALUES
(123456789, TIMESTAMP('2019-03-16-16.59.26'), TIMESTAMP('2020-03-17-10.58.25'))
, (134232125, TIMESTAMP('2019-03-18-08.57.05'), TIMESTAMP('2020-03-19-01.47.26'))
)
SELECT job_number
, TIMESTAMPDIFF(8, CHAR(TIMESTAMP2 - TIMESTAMP1)) HOURS_TSDIFF
, ((DAYS(TIMESTAMP2) - DAYS(TIMESTAMP1)) * 86400 + MIDNIGHT_SECONDS(TIMESTAMP2) - MIDNIGHT_SECONDS(TIMESTAMP1)) / 3600 HOURS_REAL
FROM TAB;
The result is:
|JOB_NUMBER |HOURS_TSDIFF|HOURS_REAL |
|-----------|------------|-----------|
|123456789 |8777 |8801 |
|134232125 |8776 |8800 |
Related
I have got a table as follows,
ID
CreatedAt_1
CreatedAt_2
ABC
2022-06-10 20:28:37
CFR
2022-06-13 10:00:12
2022-06-10 20:28:14
PFR
2022-06-17 12:20:40
XYZ
2022-06-15 11:00:12
2022-06-10 16:45:05
DFL
2022-06-13 15:00:06
FGT
2022-06-20 10:00:20
2022-06-10 13:34:55
I already used this query to count number of rows on specific date for each column separately :
SELECT
(CAST(datetrunc(‘day’, ‘createdAt_1’ + (INTERVAL '1 day'))) AS timestamp) + (INTERVAL '-1 day')) AS ‘new user’,
count(*) AS ‘count’
FROM Table
WHERE time_interval
GROUP BY ‘new user’
And get something like :
Day
Count
2022-06-10
1
2022-06-13
2
2022-06-15
1
I would like to be able to compare both columns and get percentage on specific day count(createdAt_1) / count(createdAt_2) * 100 but i don’t see how to easily do it.
I wasn't able to verify is Metabase SQL supports the SQL standard coalesce() function, but the purpose of that function is to return the first non-null value it encounters amongst the parameters passed into it (left to right). If it is supported I suggest the query below.
SELECT
datetrunc('day', coalesce('createdAt_1', 'createdAt_2')) AS "new user"
, count(*) AS "count"
FROM TABLE
-- WHERE time_interval is true ??
GROUP BY
datetrunc('day', coalesce('createdAt_1', 'createdAt_2'))
coalesce('createdAt_1', 'createdAt_2') would return the value in createdAt_1 if that isn't NULL, but would return the value in createdAt_2 if createdAt_1 is NULL. If both columns are NULL then overall it returns NULL.
typically labels are denoted by double quotes e.g. "new user"
I don't believe you need to add or subtract intervals, or convert to timestamp. The datetrunc() function should be sufficient in itself.
I have not used the "new user" alias in the group by clause as this is my preference
It isn't clear to me what that where clause is achieving.
I have a table in PostgreSQL 13 that looks like this (modified for the purpose of this question):
SELECT * FROM visits.visitors_log;
visitor_id | day | source
--------------+------------------------+----------
9 | 2019-12-30 12:10:10-05 | Twitter
7 | 2019-12-14 22:10:26-04 | Netflix
5 | 2019-12-13 15:21:04-05 | Netflix
9 | 2019-12-22 23:34:47-05 | Twitter
7 | 2019-12-22 00:10:26-04 | Netflix
9 | 2019-12-22 13:20:42-04 | Twitter
After converting the times to another timezone, I want to calculate the percentage of visits on 2019-12-22 that came from a specific source.
There are 4 steps involved:
Convert timezones
Calculate how many total visits happened on that day
Calculate how many total visits happened on that day that came from source Netflix
Divide those 2 numbers to get percentage.
I wrote this code which works but seems repetitive and not very clean:
SELECT (SELECT COUNT(*) FROM (SELECT visitor_id, source, day AT TIME ZONE 'PST' FROM visits.visitors_log WHERE day::date = '2019-12-22') AS a
WHERE day::date = '2019-12-22' AND source = 'Netflix') * 100.0
/
(SELECT COUNT(*) FROM (SELECT visitor_id, source, day AT TIME ZONE 'PST' FROM visits.visitors_log WHERE day::date = '2019-12-22') AS b
WHERE day::date = '2019-12-22')
AS visitors_percentage;
Can anyone suggest a neater way of answering this question?
Use an aggregate FILTER clause:
SELECT count(*) FILTER (WHERE source = 'Netflix') * 100.0
/ count(*) AS visitors_percentage
FROM visits.visitors_log
WHERE day >= timestamp '2019-12-22' AT TIME ZONE 'PST'
AND day < timestamp '2019-12-23' AT TIME ZONE 'PST';
See:
Aggregate columns with additional (distinct) filters
I rephrased the WHERE condition so it is "sargable" and can use an index on (day). A predicate with an expression on the column cannot use a plain index. So I moved the computation of inclusive lower and exclusive upper bound (day boundaries for the given time zone) to the right side of the expressions in the WHERE clause.
Makes a huge difference for performance with big tables.
If you use that query a lot, consider crating a function for it:
CREATE OR REPLACE FUNCTION my_func(_source text, _day date, _tz text)
RETURNS numeric
LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
$func$
SELECT round(count(*) FILTER (WHERE source = _source) * 100.0 / count(*), 2) AS visitors_percentage
FROM visits.visitors_log
WHERE day >= _day::timestamp AT TIME ZONE _tz
AND day < (_day + 1)::timestamp AT TIME ZONE _tz;
$func$;
Call:
SELECT my_func('Netflix', '2019-12-22', 'PST');
I threw in round(), which is a totally optional addition.
db<>fiddle here
Aside: "day" is a rather misleading name for a timestamp with time zone column.
Hmmm . . . You can use window functions to calculate the total:
SELECT source, COUNT(*) / SUM(COUNT(*)) OVER () as visitors_percentage
FROM visits.visitors_log
WHERE (day AT TIME ZONE 'PST')::date = '2019-12-22'
GROUP BY SOURCE
Good day, all. I wrote a question relating to this earlier, but now I have encountered another problem.
I have to calculate the timestamp difference between the install_time and contributer_time columns. HOWEVER, I have three contributor_time columns, and I need to select the latest time from those columns first then subtract it from install time.
Sample Data
users
install_time
contributor_time_1
contributor_time_2
contributor_time_3
1
8:00
7:45
7:50
7:55
2
10:00
9:15
9:45
9:30
3
11:00
10:30
null
null
For example, in the table above I would need to select contributor_time_3 and subtract it from install_time for user 1. For user 2, I would do the same, but with contributor_time_2.
Sample Results
users
install_time
time_diff_min
1
8:00
5
2
10:00
15
3
11:00
30
The problem I am facing is that 1) the contributor_time columns are in string format and 2) some of them have 'null' string values (which means that I cannot cast it into a timestamp.)
I created a query, but I am am facing an error stating that I cannot subtract a string from timestamp. So I added safe_cast, however the time_diff_min results are only showing when I have all three contributor_time columns as a timestamp. For example, in the sample table above, only the first two rows will pull.
The query I have so far is below:
SELECT
users,
install_time,
TIMESTAMP_DIFF(install_time, greatest(contributor_time_1, contributor_time_2, contributor_time_3), MINUTE) as ctct_min
FROM
(SELECT
users,
install_time,
safe_cast(contributor_time_1 as timestamp) as contributor_time_1,
safe_cast(contributor_time_2 as timestamp) as contributor_time_2,
safe_cast(contributor_time_3 as timestamp) as contributor_time_3,
FROM
(SELECT
users,
install_time,
case when contributor_time_1 = 'null' then '0' else contributor_time_1 end as contributor_time_1,
....
FROM datasource
Any help to point me in the right direction is appreciated! Thank you in advance!
Consider below
select users, install_time,
time_diff(
parse_time('%H:%M',install_time),
greatest(
parse_time('%H:%M',contributor_time_1),
parse_time('%H:%M',contributor_time_2),
parse_time('%H:%M',contributor_time_3)
),
minute) as time_diff_min
from `project.dataset.table`
if applied to sample data in your question - output is
Above can be refactored slightly into below
create temp function latest_time(arr any type) as ((
select parse_time('%H:%M',val) time
from unnest(arr) val
order by time desc
limit 1
));
select users, install_time,
time_diff(
parse_time('%H:%M',install_time),
latest_time([contributor_time_1, contributor_time_2, contributor_time_3]),
minute) as time_diff_min
from `project.dataset.table`
less verbose and no redundant parsing - with same result - so just matter of preferences
You can use greatest():
select t.*,
time_diff(install_time, greatest(contributor_time_1, contributor_time_2, contributor_time_3), minute) as diff_min
from t;
Note: this assumes that the values are never NULL, which seems reasonable based on your sample data.
I have a schema with the following fields:
Name of row | Type
--------------------------+--------
name | string
value1 | numeric
timestamp | bigint
The rows contain entries with a name, a numeric value and a bigint value storing the unix timestamp in nanoseconds. Using TimescaleDB and I would like to use time_buckets_gapfill to retrieve the data. Given the timestamps are stored in bigint, this is quite cumbersome.
I would like to get aggregated data for these intervals: 5min, hour, day, week, month, quarter, year. I have managed to make it work using normal time_buckets, but now I would like to fill the gaps as well. I am using the following query now:
SELECT COALESCE(COUNT(*), 0), COALESCE(SUM(value1), 0), time_bucket_gapfill('5 min', date_trunc('quarter', to_timestamp(timestamp/1000000000)), to_timestamp(1599100000), to_timestamp(1599300000)) AS bucket
FROM playground
WHERE name = 'test' AND timestamp >= 1599100000000000000 AND timestamp <= 1599300000000000000
GROUP BY bucket
ORDER BY bucket ASC
This returns the values correctly, but does not fill the empty spaces. If I modified my query to
time_bucket_gapfill('5 min',
date_trunc('quarter',
to_timestamp(timestamp/1000000000),
to_timestamp(1599100000),
to_timestamp(1599200000))
I would get the first entry correctly and then empty rows every 5 minutes. How could I make it work? Thanks!
Here is a DB fiddle, but it doesn't work as it doesn't support TimeScaleDB. The query above returns the following:
coalesce | coalesce | avg_val
------------------------+-------------------------
3 | 300 | 2020-07-01 00:00:00+00
0 | 0 | 2020-09-03 02:25:00+00
0 | 0 | 2020-09-03 02:30:00+00
You should use datatypes in your time_bucket_gapfill that matches the datatypes in your table. The following query should get you what you are looking for:
SELECT
COALESCE(count(*), 0),
COALESCE(SUM(value1), 0),
time_bucket_gapfill(300E9::BIGINT, timestamp) AS bucket
FROM
t
WHERE
name = 'example'
AND timestamp >= 1599100000000000000
AND timestamp < 1599200000000000000
GROUP BY
bucket;
I have managed to solve it by building on Sven's answer. It first uses his function to fill out the gaps and then date_trunc is called eliminating the extra rows.
WITH gapfill AS (
SELECT
COALESCE(count(*), 0) as count,
COALESCE(SUM(value1), 0) as sum,
time_bucket_gapfill(300E9::BIGINT, timestamp) as bucket
FROM
playground
WHERE
name = 'test'
AND timestamp >= 1599100000000000000
AND timestamp < 1599300000000000000
GROUP BY
bucket
)
SELECT
SUM(count),
SUM(sum),
date_trunc('quarter', to_timestamp(bucket/1000000000)) as truncated
FROM
gapfill
GROUP BY truncated
ORDER BY truncated ASC
I'd like to use the generate series function in redshift, but have not been successful.
The redshift documentation says it's not supported. The following code does work:
select *
from generate_series(1,10,1)
outputs:
1
2
3
...
10
I'd like to do the same with dates. I've tried a number of variations, including:
select *
from generate_series(date('2008-10-01'),date('2008-10-10 00:00:00'),1)
kicks out:
ERROR: function generate_series(date, date, integer) does not exist
Hint: No function matches the given name and argument types.
You may need to add explicit type casts. [SQL State=42883]
Also tried:
select *
from generate_series('2008-10-01 00:00:00'::timestamp,
'2008-10-10 00:00:00'::timestamp,'1 day')
And tried:
select *
from generate_series(cast('2008-10-01 00:00:00' as datetime),
cast('2008-10-10 00:00:00' as datetime),'1 day')
both kick out:
ERROR: function generate_series(timestamp without time zone, timestamp without time zone, "unknown") does not exist
Hint: No function matches the given name and argument types.
You may need to add explicit type casts. [SQL State=42883]
If not looks like I'll use this code from another post:
SELECT to_char(DATE '2008-01-01'
+ (interval '1 month' * generate_series(0,57)), 'YYYY-MM-DD') AS ym
PostgreSQL generate_series() with SQL function as arguments
Amazon Redshift seems to be based on PostgreSQL 8.0.2. The timestamp arguments to generate_series() were added in 8.4.
Something like this, which sidesteps that problem, might work in Redshift.
SELECT current_date + (n || ' days')::interval
from generate_series (1, 30) n
It works in PostgreSQL 8.3, which is the earliest version I can test. It's documented in 8.0.26.
Later . . .
It seems that generate_series() is unsupported in Redshift. But given that you've verified that select * from generate_series(1,10,1) does work, the syntax above at least gives you a fighting chance. (Although the interval data type is also documented as being unsupported on Redshift.)
Still later . . .
You could also create a table of integers.
create table integers (
n integer primary key
);
Populate it however you like. You might be able to use generate_series() locally, dump the table, and load it on Redshift. (I don't know; I don't use Redshift.)
Anyway, you can do simple date arithmetic with that table without referring directly to generate_series() or to interval data types.
select (current_date + n)
from integers
where n < 31;
That works in 8.3, at least.
Using Redshift today, you can generate a range of dates by using datetime functions and feeding in a number table.
select (getdate()::date - generate_series)::date from generate_series(1,30,1)
Generates this for me
date
2015-11-06
2015-11-05
2015-11-04
2015-11-03
2015-11-02
2015-11-01
2015-10-31
2015-10-30
2015-10-29
2015-10-28
2015-10-27
2015-10-26
2015-10-25
2015-10-24
2015-10-23
2015-10-22
2015-10-21
2015-10-20
2015-10-19
2015-10-18
2015-10-17
2015-10-16
2015-10-15
2015-10-14
2015-10-13
2015-10-12
2015-10-11
2015-10-10
2015-10-09
2015-10-08
The generate_series() function is not fully supported by Redshift. See the Unsupported PostgreSQL functions section of the developer guide.
UPDATE
generate_series is working with Redshift now.
SELECT CURRENT_DATE::TIMESTAMP - (i * interval '1 day') as date_datetime
FROM generate_series(1,31) i
ORDER BY 1
This will generate last 30 days date
Ref: generate_series function in Amazon Redshift
As of writing this, generate_series() on our instance of Redshift (1.0.33426) could not be used to, for example, create a table:
# select generate_series(1,100,1);
1
2
...
# create table normal_series as select generate_series(1,100,1);
INFO: Function "generate_series(integer, integer, integer) not supported.
ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables.
However, with recursive works:
# create table recursive_series as with recursive t(n) as (select 1::integer union all select n+1 from t where n < 100) select n from t;
SELECT
-- modify as desired, here is a date series:
# select getdate()::date + n from recursive_series;
2021-12-18
2021-12-19
...
I needed to do something similar, but with 5 minutes intervals over 7 days. So here's a CTE based hack (ugly but not too verbose)
INSERT INTO five_min_periods
WITH
periods AS (select 0 as num UNION select 1 as num UNION select 2 UNION select 3 UNION select 4 UNION select 5 UNION select 6 UNION select 7 UNION select 8 UNION select 9 UNION select 10 UNION select 11),
hours AS (select num from periods UNION ALL select num + 12 from periods),
days AS (select num from periods where num <= 6),
rightnow AS (select CAST( TO_CHAR(GETDATE(), 'yyyy-mm-dd hh24') || ':' || trim(TO_CHAR((ROUND((DATEPART (MINUTE, GETDATE()) / 5), 1) * 5 ),'09')) AS TIMESTAMP) as start)
select
ROW_NUMBER() OVER(ORDER BY d.num DESC, h.num DESC, p.num DESC) as idx
, DATEADD(minutes, -p.num * 5, DATEADD( hours, -h.num, DATEADD( days, -d.num, n.start ) ) ) AS period_date
from days d, hours h, periods p, rightnow n
Should be able to extend this to other generation schemes. The trick here is using the Cartesian product join (i.e. no JOIN/WHERE clause) to multiply the hand-crafted CTE's to produce the necessary increments and apply to an anchor date.
Redshift's generate_series() function is a leader node only function and as such you cannot use it for downstream processing on the compute nodes. This can be replace by a recursive CTE (or keep a "dates" table on your database). I have an example of such in a recent answer:
Cross join Redshift with sequence of dates
One caution I like to give in answers like this is to be careful with inequality joins (or cross joins or any under-qualified joins) when working with VERY LARGE tables which can happen often in Redshift. If you are joining with a moderate Redshift table of say 1M rows then things will be fine. But if you are doing this on a table of 1B rows then the data explosion will likely cause massive performance issues as the query spills to disk.
I've written a couple of white papers on how to write this type of query in a data space sensitive way. This issue of massive intermediate results is not unique to Redshift and I first developed my approach solving a client's HIVE query issue. "First rule of writing SQL for Big Data - don't make more"
Per the comments of #Ryan Tuck and #Slobodan Pejic generate_series() does not work on Redshift when joining to another table.
The workaround I used was to write out every value in the series in the query:
SELECT
'2019-01-01'::date AS date_month
UNION ALL
SELECT
'2019-02-01'::date AS date_month
Using a Python function like this:
import arrow
def generate_date_series(start, end):
start = arrow.get(start)
end = arrow.get(end)
months = list(
f"SELECT '{month.format('YYYY-MM-DD')}'::date AS date_month"
for month in arrow.Arrow.range('month', start, end)
)
return "\nUNION ALL\n".join(months)
perhaps not as elegant as other solutions, but here's how I did it:
drop table if exists #dates;
create temporary table #dates as
with recursive cte(val_date) as
(select
cast('2020-07-01' as date) as val_date
union all
select
cast(dateadd(day, 1, val_date) as date) as val_date
from
cte
where
val_date <= getdate()
)
select
val_date as yyyymmdd
from
cte
order by
val_date
;
For five minute buckets i would do the following:
select date_trunc('minute', getdate()) - (i || ' minutes')::interval
from generate_series(0, 60*5-1, 5) as i
You could replace 5 by any given interval, and 60 by the number of rows you want.
SELECT CURRENT_DATE::TIMESTAMP - (i * interval '1 day') as date_datetime
FROM generate_series(1,(select datediff(day,'01-Jan-2021',now()::date))) i
ORDER BY 1