Average timestamp in one column in BigQuery - sql

I need to find the average of the order that came:
Order_Date
2022-06-02 15:40:00 UTC
2022-06-07 11:01:00 UTC
2022-06-21 10:55:00 UTC
2022-06-23 14:44:00 UTC
Outcome:
average Order_Date *that came

Just apply the AVG() average function over your entire table:
SELECT AVG(Order_Date) AS Avg_Order_Date
FROM yourTable;

Average timestamp is unusual ask! But anyway, formally you can do below
select
timestamp_seconds(cast(avg(unix_seconds(timestamp(Order_date))) as int64)) as average_Order_Date
from your_table
if applied to sample data in your question - output is
Note: Supported signatures for AVG: AVG(INT64); AVG(FLOAT64); AVG(NUMERIC); AVG(BIGNUMERIC); AVG(INTERVAL) - that is why you need all this back and forth "translations"

WITH CTE as
(
SELECT Order_Date, LAG(Order_Date,1) OVER(ORDER BY Order_Date ASC) as Datelag
FROM table
),
CTE2 as
(
SELECT Order_Date, datetime_diff(Order_Date,Datelag,hour) as Datedif
FROM CTE
)
SELECT AVG(Datedif)
FROM CTE2

Related

T-SQL, list of DATETIME, create <from> - <to> from it

What would I need to do to achieve the following? Somehow I can't seem to find a good solution.
I have a few CTEs, and the last one is producing just a list of DATETIME values, with a row number column, and those are ordered by the DATETIME.
For example
rn datetime
---------------------------
1 2023-01-07 01:00:00.000
2 2023-01-08 05:30:00.000
3 2023-01-08 08:00:00.000
4 2023-01-09 21:30:00.000
How do I have to join this CTE with each other in order to get the following result:
from to
---------------------------------------------------
2023-01-07 01:00:00.000 2023-01-08 05:30:00.000
2023-01-08 08:00:00.000 2023-01-09 21:30:00.000
Doing a regular inner join (with t1.rn = t2.rn - 1) gives me one row too much (the one from 05:30 to 08:00). So basically each date can only be "used" once.
Hope that makes sense... thanks!
Inner joining the CTE with itself, which didn't return the wanted result.
You can pivot the outcome of your CTE and distribute rows using arithmetics : modulo 2 comes to mind.
Assuming that your CTE returns columns dt (a datetime field) and rn (an integer row number) :
select min(dt) dt_from, max(dt) dt_to
from cte
group by ( rn - 1 ) % 2
On T-SQL we could also leverage integer division to express the pair grouping:
group by ( rn - 1 ) / 2
You can avoid both the JOIN and the GROUP BY by using LAG to retrieve a previous column value in the result set. The server may be able to generate an execution plan that iterates over the data just once instead of joining or grouping :
with pairs as (
SELECT rn, lag(datetime) OVER(ORDER BY rn) as dt_from, datetime as dt_to
from another_cte
)
select dt_from,dt_to
from pairs
ORDER BY rn
where rn%2=0
The row number itself can be calculated from datetime :
with pairs as (
SELECT
ROW_NUMBER() OVER(ORDER BY datetime) AS rn,
lag(datetime) OVER(ORDER BY datetime) as dt_from,
datetime as dt_to
from another_cte
)
select dt_from,dt_to
from pairs
ORDER BY dt_to
where rn%2=0

Potsgres SQL: select timestamp prior to max timestamp

I have a table in Postgres with timestamps:
timestamp
2022-01-01 00:52:53
2022-01-01 00:57:12
...
2022-02-13 11:00:31
2022-02-13 16:45:10
How can I select the timestamp closest to max timestamp? Meaning, I want the timestamp 2022-02-13 11:00:31.
I am looking for something like max(timestamp)-1 so I can do on a recurring basis. Thank you
You can do:
select *
from (
select *,
rank() over(order by timestamp desc) as rk
from t
) x
where rk = 2
See running example at DB Fiddle.
I think the following query might meet your requirements:
SELECT MAX(date_col) FROM test WHERE date_col < (SELECT MAX(date_col) from test);
See DB Fiddle

SQL to find sum of total days in a window for a series of changes

Following is the table:
start_date
recorded_date
id
2021-11-10
2021-11-01
1a
2021-11-08
2021-11-02
1a
2021-11-11
2021-11-03
1a
2021-11-10
2021-11-04
1a
2021-11-10
2021-11-05
1a
I need a query to find the total day changes in aggregate for a given id. In this case, it changed from 10th Nov to 8th Nov so 2 days, then again from 8th to 11th Nov so 3 days and again from 11th to 10th for a day, and finally from 10th to 10th, that is 0 days.
In total there is a change of 2+3+1+0 = 6 days for the id - '1a'.
Basically for each change there is a recorded_date, so we arrange that in ascending order and then calculate the aggregate change of days grouped by id. The final result should be like:
id
Agg_Change
1a
6
Is there a way to do this using SQL. I am using vertica database.
Thanks.
you can use window function lead to get the difference between rows and then group by id
select id, sum(daydiff) Agg_Change
from (
select id, abs(datediff(day, start_Date, lead(start_date,1,start_date) over (partition by id order by recorded_date))) as daydiff
from tablename
) t group by id
It's indeed the use of LAG() to get the previous date in an OLAP query, and an outer query getting the absolute date difference, and the sum of it, grouping by id:
WITH
-- your input - don't use in real query ...
indata(start_date,recorded_date,id) AS (
SELECT DATE '2021-11-10',DATE '2021-11-01','1a'
UNION ALL SELECT DATE '2021-11-08',DATE '2021-11-02','1a'
UNION ALL SELECT DATE '2021-11-11',DATE '2021-11-03','1a'
UNION ALL SELECT DATE '2021-11-10',DATE '2021-11-04','1a'
UNION ALL SELECT DATE '2021-11-10',DATE '2021-11-05','1a'
)
-- real query starts here, replace following comma with "WITH" ...
,
w_lag AS (
SELECT
id
, start_date
, LAG(start_date) OVER w AS prevdt
FROM indata
WINDOW w AS (PARTITION BY id ORDER BY recorded_date)
)
SELECT
id
, SUM(ABS(DATEDIFF(DAY,start_date,prevdt))) AS dtdiff
FROM w_lag
GROUP BY id
-- out id | dtdiff
-- out ----+--------
-- out 1a | 6
I was thinking lag function will provide me the answer, but it kept giving me wrong answer because I had the wrong logic in one place. I have the answer I need:
with cte as(
select id, start_date, recorded_date,
row_number() over(partition by id order by recorded_date asc) as idrank,
lag(start_date,1) over(partition by id order by recorded_date asc) as prev
from table_temp
)
select id, sum(abs(date(start_date) - date(prev))) as Agg_Change
from cte
group by 1
If someone has a better solution please let me know.

get count all with groupby timestamp into hourly intervals

I have a hive table that has a timestamp in string format as below,
20190516093836, 20190304125015, 20181115101358
I want to get row count with an aggregate timestamp into hourly as below
date_time count
-----------------------------
2019:05:16: 00:00:00 23
2019:05:16: 01:00:00 64
I followed several links like this but was unable to generate the desired results yet.
This is my final query:
SELECT
DATE_PART('day', b.date_time) AS date_prt,
DATE_PART('hour', b.date_time) AS hour_prt,
COUNT(*)
FROM
(SELECT
from_unixtime(unix_timestamp(`timestamp`, "yyyyMMddHHmmss")) AS date_time
FROM table_name
WHERE from_unixtime(unix_timestamp(`timestamp`, "yyyyMMddHHmmss"))
BETWEEN '2018-12-10 07:02:30' AND '2018-12-12 08:02:30') b
GROUP BY
date_prt, hour_prt
I hope for some guidance from you, thanks in advance
You can extract date_time already in required format 'yyyy-MM-dd HH:00:00'. I prefer using regexp_replace:
SELECT
date_time,
COUNT(*) as `count`
FROM
(SELECT
regexp_replace(`timestamp`, '^(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})$','$1-$2-$3 $4:00:00') AS date_time
FROM table_name
WHERE regexp_replace(`timestamp`, '^(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})$','$1-$2-$3 $4:$5:$6')
BETWEEN '2018-12-10 07:02:30' AND '2018-12-12 08:02:30') b
GROUP BY
date_time
This will also work:
from_unixtime(unix_timestamp('20190516093836', "yyyyMMddHHmmss"),'yyyy-MM-dd HH:00:00') AS date_time

Collapse multiple rows based on time values

I'm trying to collapse rows with consecutive timeline within the same day into one row but having an issue because of gap in time. For example, my dataset looks like this.
Date StartTime EndTime ID
2017-12-1 09:00:00 11:00:00 12345
2017-12-1 11:00:00 13:00:00 12345
2018-09-08 09:00:00 10:00:00 78465
2018-09-08 10:00:00 12:00:00 78465
2018-09-08 15:00:00 16:00:00 78465
2018-09-08 16:00:00 18:00:00 78465
As up can see, the first two rows can just be combined together without any issue because there's no time gap within that day. However. for the entries on 2019-09-08, there is a gap between 12:00 and 15:00. And I'd like to merge these four records into two different rows like this:
Date StartTime EndTime ID
2017-12-1 09:00:00 13:00:00 12345
2018-09-08 09:00:00 12:00:00 78465
2018-09-08 15:00:00 18:00:00 78465
In other words, I only want to collapse the rows only when the time variables are consecutive within the same day for the same ID.
Could anyone please help me with this? I tried to generate unique group using LAG and LEAD functions but it didn't work.
You can use a recursive cte. Group it as same group if the EndTime is same as next StartTime. And then find the MIN() and MAX()
with cte as
(
select rn = row_number() over (partition by [ID], [Date] order by [StartTime]),
*
from tbl
),
rcte as
(
-- anchor member
select rn, [ID], [Date], [StartTime], [EndTime], grp = 1
from cte
where rn = 1
union all
-- recursive member
select c.rn, c.[ID], c.[Date], c.[StartTime], c.[EndTime],
grp = case when r.[EndTime] = c.[StartTime]
then r.grp
else r.grp + 1
end
from rcte r
inner join cte c on r.[ID] = c.[ID]
and r.[Date] = c.[Date]
and r.rn = c.rn - 1
)
select [ID], [Date],
min([StartTime]) as StartTime,
max([EndTime]) as EndTime
from rcte
group by [ID], [Date], grp
db<>fiddle demo
Unless you have a particular objection to collapsing non-consecutive rows, which are consecutive for that ID, you can just use GROUP BY:
SELECT
Date,
StartTime = MIN(StartTime),
EndTime = MAX(EndTime),
ID
FROM table
GROUP BY ID, Date
Otherwise you can use a solution based on ROW_NUMBER:
SELECT
Date,
StartTime,
EndTime,
ID
FROM (
SELECT *,
rn = ROW_NUMBER() OVER (PARTITION BY Date, ID ORDER BY StartTime)
FROM table
) t
WHERE rn = 1
This is an example of a gaps-and-islands problem -- actually a pretty simple example. The idea is to assign an "island" grouping to each row specifying that they should be combined because they overlap. Then aggregate.
How do you assign the island? In this case, look at the previous endtime and if it is different from the starttime, then the row starts a new island. Voila! A cumulative sum of the the start flag identifies each island.
As SQL:
select id, date, min(starttime), max(endtime)
from (select t.*,
sum(case when prev_endtime = starttime then 0 else 1 end) over (partition by id, date order by starttime) as grp
from (select t.*,
lag(endtime) over (partition by id, date order by starttime) as prev_endtime
from t
) t
) t
group by id, date, grp;
Here is a db<>fiddle.
Note: This assumes that the time periods never span multiple days. The code can be very easily modified to handle that . . . but with a caveat. The start and end times should be stored as datetime (or a related timestamp) rather than separating the date and times into different columns. Why? SQL Server doesn't support '24:00:00' as a valid time.