How to scale timestamp from one range to another range? - sql

I've got table with timestamps of item changes:
ID: INT
item: INT - foreign key
created: TIMESTAMP
I need for certain item to squeeze created timestamps to some range. Ie.:
src range: 1.10.2015 00:00 - 4.10.2015 23:59
dst range: 1.10.2015 00:00 - 1.10.2015 23:59
so, sample TIMESTAMP would look like:
2.10.2015 00:00 -> 1.10.2015 6:00
3.10.2015 00:00 -> 1.10.2015 12:00
4.10.2015 00:00 -> 1.10.2015 18:00
4.10.2015 12:00 -> 1.10.2015 21:00
I need to keep year, month, day, hour, minute, seconds part of resulting timestamp. Must not round seconds/minutes to zeroes.
Scale precision doesn't matter much, but order of changes by created timestamp must be kept same as it was before scaling. I don't care much about other scaling details: whether it's start/final time, inclusive/exclusive, or whether it scales to range of 6:00/5:59 hours.
I can do it with some external application that would transform it on its own, then updates timestamps. However, I now need to do it using SQL only. Is it possible? It may be postgres specific.
You may assume, there won't be collision after scaling applies.

something like this?
select id, item, (' ['||created||', '||created_l- '00:00.000001'::time||') ' )::tsrange as some_range from (
select id, item, created , lead(created) over(order by created) created_l from my_table order by created) q1

with r (src, dst) as ( values
(
tsrange('2015-10-01', '2015-10-05', '[)'),
tsrange('2015-10-01', '2015-10-02', '[)')
),
(
tsrange('2015-10-06', '2015-10-10', '[)'),
tsrange('2015-10-02', '2015-10-03', '[)')
)
), t (id, src) as ( values
(1,'2015-10-02 00:00'::timestamp),
(2,'2015-10-03 00:00'),
(3,'2015-10-04 00:00'),
(4,'2015-10-04 12:00'),
(5,'2015-10-06 11:00'),
(6,'2015-10-07 15:00')
)
select
t.id, t.src,
lower(r.dst) + rank() over(
partition by r.dst order by t.src
) * interval '1 second' as squeezed
from
t
inner join
r on t.src <# r.src
;
id | src | squeezed
----+---------------------+---------------------
1 | 2015-10-02 00:00:00 | 2015-10-01 00:00:01
2 | 2015-10-03 00:00:00 | 2015-10-01 00:00:02
3 | 2015-10-04 00:00:00 | 2015-10-01 00:00:03
4 | 2015-10-04 12:00:00 | 2015-10-01 00:00:04
5 | 2015-10-06 11:00:00 | 2015-10-02 00:00:01
6 | 2015-10-07 15:00:00 | 2015-10-02 00:00:02

Related

Finding total session time of a user in postgres

I am trying to create a query that will give me a column of total time logged in for each month for each user.
username | auth_event_type | time | credential_id
Joe | 1 | 2021-11-01 09:00:00 | 44
Joe | 2 | 2021-11-01 10:00:00 | 44
Jeff | 1 | 2021-11-01 11:00:00 | 45
Jeff | 2 | 2021-11-01 12:00:00 | 45
Joe | 1 | 2021-11-01 12:00:00 | 46
Joe | 2 | 2021-11-01 12:30:00 | 46
Joe | 1 | 2021-12-06 14:30:00 | 47
Joe | 2 | 2021-12-06 15:30:00 | 47
The auth_event_type column specifies whether the event was a login (1) or logout (2) and the credential_id indicates the session.
I'm trying to create a query that would have an output like this:
username | year_month | total_time
Joe | 2021-11 | 1:30
Jeff | 2021-11 | 1:00
Joe | 2021-12 | 1:00
How would I go about doing this in postgres? I am thinking it would involve a window function? If someone could point me in the right direction that would be great. Thank you.
Solution 1 partially working
Not sure that window functions will help you in your case, but aggregate functions will :
WITH list AS
(
SELECT username
, date_trunc('month', time) AS year_month
, max(time ORDER BY time) - min(time ORDER BY time) AS session_duration
FROM your_table
GROUP BY username, date_trunc('month', time), credential_id
)
SELECT username
, to_char (year_month, 'YYYY-MM') AS year_month
, sum(session_duration) AS total_time
FROM list
GROUP BY username, year_month
The first part of the query aggregates the login/logout times for the same username, credential_id, the second part makes the sum per year_month of the difference between the login/logout times. This query works well until the login time and logout time are in the same month, but it fails when they aren't.
Solution 2 fully working
In order to calculate the total_time per username and per month whatever the login time and logout time are, we can use a time range approach which intersects the session ranges [login_time, logout_time) with the monthly ranges [monthly_start_time, monthly_end_time) :
WITH monthly_range AS
(
SELECT to_char(m.month_start_date, 'YYYY-MM') AS month
, tsrange(m.month_start_date, m.month_start_date+ interval '1 month' ) AS monthly_range
FROM
( SELECT generate_series(min(date_trunc('month', time)), max(date_trunc('month', time)), '1 month') AS month_start_date
FROM your_table
) AS m
), session_range AS
(
SELECT username
, tsrange(min(time ORDER BY auth_event_type), max(time ORDER BY auth_event_type)) AS session_range
FROM your_table
GROUP BY username, credential_id
)
SELECT s.username
, m.month
, sum(upper(p.period) - lower(p.period)) AS total_time
FROM monthly_range AS m
INNER JOIN session_range AS s
ON s.session_range && m.monthly_range
CROSS JOIN LATERAL (SELECT s.session_range * m.monthly_range AS period) AS p
GROUP BY s.username, m.month
see the result in dbfiddle
Use the window function lag() with a partition it by credential_id ordered by time, e.g.
WITH j AS (
SELECT username, time, age(time, LAG(time) OVER w)
FROM t
WINDOW w AS (PARTITION BY credential_id ORDER BY time
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
)
SELECT username, to_char(time,'yyyy-mm'),sum(age) FROM j
GROUP BY 1,2;
Note: the frame ROWS BETWEEN 1 PRECEDING AND CURRENT ROW is pretty much optional in this case, but it is considered a good practice to keep window functions as explicit as possible, so that in the future you don't have to read the docs to figure out what your query is doing.
Demo: db<>fiddle

Summing counts based on overlapping intervals in postgres

I want to sum the column for every two minute interval (so it would be the sum of 1,2 and 2,3 and 3,4, etc...), but I'm not exactly sure how to go about doing that.
My data looks something like:
minute | source | count
2018-01-01 10:00 | a | 7
2018-01-01 10:01 | a | 5
2018-01-01 10:02 | a | 10
2018-01-01 10:00 | b | 20
2018-01-01 10:05 | a | 12
What I want
(e.g. row1+row2, row2+3, row3, row4, row5)
minute | source | count
2018-01-01 10:00 | a | 12
2018-01-01 10:01 | a | 15
2018-01-01 10:02 | a | 10
2018-01-01 10:00 | b | 20
2018-01-01 10:05 | a | 12
You can use a correlated subquery selecting the sum of the counts for the records in the interval sharing the source (I guess that the source must match is an requirement. If not, just remove the comparison in the WHERE clause.).
SELECT "t1"."minute",
"t1"."source",
(SELECT sum("t2"."count")
FROM "elbat" "t2"
WHERE "t2"."source" = "t1"."source"
AND "t2"."minute" >= "t1"."minute"
AND "t2"."minute" <= "t1"."minute" + INTERVAL '1 MINUTE') "count"
FROM "elbat" "t1";
SQL Fiddle
the post above assumes all the timestamps are to the minute. if you want to check for every 2 minutes throughout the day you can use the generate_series function. the issue with including the beginning minute and ending time in each interval will be b having 2 rows in the results.
ie.
select begintime,
endtime,
source,
sum(count)
from mytable
inner join (
select begintime, endtime
from (
select lag(time, 1) over (order by time) as begintime,
time as endtime
from (
select *
from generate_series('2018-01-01 00:00:00', '2018-01-02 00:00:00', interval '2 minutes') time
) q
) q2
where begintime is not null
) times on minute between begintime and endtime
group by begintime, endtime, source
order by begintime, endtime, source
you can change the 'minute between begintime and endtime' to 'minute > begintime and minute <= endtime' if you don't want that overlap

Postgres query for calendar

I am trying to write a query to retrieve data from an events query for a simple calendar app.
The table structure is as followed:
table name: events
Column | Type
---------+-----------
id | integer
start | timestamp
end | timestamp
the data inside of the table
id| start | end
--+---------------------+--------------------
1 | 2017-09-01 12:00:00 | 2017-09-01 12:00:00
2 | 2017-09-03 10:00:00 | 2017-09-03 12:00:00
3 | 2017-09-08 12:00:00 | 2017-09-11 12:00:00
4 | 2017-09-11 12:00:00 | 2017-09-11 12:00:00
the expected result is
date | event.id
-----------+---------
2017-09-01 | 1
2017-09-03 | 2
2017-09-08 | 3
2017-09-09 | 3
2017-09-10 | 3
2017-09-11 | 3
2017-09-11 | 4
As you can see, only days with an event (not just start and end, but also the days in between) is retrieved, days without an event are not retrieved at all.
In the second step I would like to be able to limit the amount of distinct days, e.g. "get 4 days with events" what might be more than 4 rows.
Right now I am able to retrieve the events based on start date only using the following query:
SELECT start::date, id FROM events WHERE events.start::date >= '2017-09-01' LIMIT 3
Thinks I already though about are DENSE_RANK and generate_series, but up to now I didn't find a way to fill the gaps between start and end, but not on days where there are no data.
So in short:
What I want to get is: get the next X days where there is an event. A date with an event is a day where start <= date >= end
Any ideas ?
Edit
Thanks to Tim I have now the following query (modified to use generate_series instead of a table and added a limit using dense_rank):
select date, id FROM (
SELECT
DENSE_RANK() OVER (ORDER BY t1.date) as rank,
t1.date,
events.id
FROM
generate_series([DATE]::date, [DATE]::date + interval '365 day', '1 day') as t1
INNER JOIN
events
ON t1.date BETWEEN events.start::date AND events."end"::date
) as t
WHERE rank <= [LIMIT]
This is working really good, even though I am not 100% sure about the performance hit with this kind of limit
I think you really need a calendar table here to cover the full range of dates in which your data may appear. In the first CTE below, I generate a table covering the month of September 2017. Then all we need to do is inner join this calendar table with the events table on the criteria of a given day appearing within a given range.
WITH cte AS (
SELECT CAST('2017-09-01' AS DATE) + (n || ' day')::INTERVAL AS date
FROM generate_series(0, 29) n
)
SELECT
t1.date,
t2.id
FROM cte t1
INNER JOIN events t2
ON t1.date BETWEEN CAST(t2.start AS DATE) AND CAST(t2.end AS DATE);
Output:
date id
1 01.09.2017 00:00:00 1
2 03.09.2017 00:00:00 2
3 08.09.2017 00:00:00 3
4 09.09.2017 00:00:00 3
5 10.09.2017 00:00:00 3
6 11.09.2017 00:00:00 3
7 11.09.2017 00:00:00 4
Demo here:
Rextester

In a table containing rows of date ranges, from each row, generate one row per day containing hours of utilization

Given a table with rows like:
+----+-------------------------+------------------------+
| ID | StartDate | EndDate |
+----+-------------------------+------------------------+
| 1 | 2016-02-05 20:00:00.000 | 2016-02-07 5:00:00.000 |
+----+-------------------------+------------------------+
I want to produce a table like this:
+----+------------+----------+
| ID | Date | Duration |
+----+------------+----------+
| 1 | 2016-02-05 | 4 |
| 1 | 2016-02-06 | 24 |
| 1 | 2016-02-07 | 5 |
+----+------------+----------+
This is an interview-style question. I am wondering how I can go about tackling this. Is it possible to do this with just standard SQL query syntax? Or is a procedural language like pl/pgSQL required to do a query like this?
The basic idea is this:
SELECT date_trunc('day', dayhour) as dd,count(*)
FROM (VALUES (1, '2016-02-05 20:00:00.000'::timestamp, '2016-02-07 5:00:00.000'::timestamp)
) v(ID, StartDate, EndDate), lateral
generate_series(StartDate, EndDate, interval '1 hour') g(dayhour)
GROUP BY dd
ORDER BY dd;
That adds an extra hour, so this is more accurate:
SELECT date_trunc('day', dayhour) as dd,count(*)
FROM (VALUES (1, '2016-02-05 20:00:00.000'::timestamp, '2016-02-07 5:00:00.000'::timestamp)
) v(ID, StartDate, EndDate), lateral
generate_series(StartDate, EndDate - interval '1 hour', interval '1 hour') g(dayhour)
GROUP BY dd
ORDER BY dd;
Technically, the lateral is not needed (and in that case, I would replace the comma with cross join). However, this is an example of a lateral join, so being explicit is good.
I should also note that the above is the simplest method. However, the group by does slow down the query. There are other methods that don't require generating a series for every hour.

Filling Out & Filtering Irregular Time Series Data

Using Postgresql 9.4, I am trying to craft a query on time series log data that logs new values whenever the value updates (not on a schedule). The log can update anywhere from several times a minute to once a day.
I need the query to accomplish the following:
Filter too much data by just selecting the first entry for the timestamp range
Fill in sparse data by using the last reading for the log value. For example, if I am grouping the data by hour and there was an entry at 8am with a log value of 10. Then the next entry isn't until 11am with a log value of 15, I would want the query to return something like this:
Timestamp | Value
2015-07-01 08:00 | 10
2015-07-01 09:00 | 10
2015-07-01 10:00 | 10
2015-07-01 11:00 | 15
I have got a query that accomplishes the first of these goals:
with time_range as (
select hour
from generate_series('2015-07-01 00:00'::timestamp, '2015-07-02 00:00'::timestamp, '1 hour') as hour
),
ranked_logs as (
select
date_trunc('hour', time_stamp) as log_hour,
log_val,
rank() over (partition by date_trunc('hour', time_stamp) order by time_stamp asc)
from time_series
)
select
time_range.hour,
ranked_logs.log_val
from time_range
left outer join ranked_logs on ranked_logs.log_hour = time_range.hour and ranked_logs.rank = 1;
But I can't figure out how to fill in the nulls where there is no value. I tried using the lag() feature of Postgresql's Window functions, but it didn't work when there were multiple nulls in a row.
Here's a SQLFiddle that demonstrates the issue:
http://sqlfiddle.com/#!15/f4d13/5/0
your columns are log_hour and first_vlue
with time_range as (
select hour
from generate_series('2015-07-01 00:00'::timestamp, '2015-07-02 00:00'::timestamp, '1 hour') as hour
),
ranked_logs as (
select
date_trunc('hour', time_stamp) as log_hour,
log_val,
rank() over (partition by date_trunc('hour', time_stamp) order by time_stamp asc)
from time_series
),
base as (
select
time_range.hour lh,
ranked_logs.log_val
from time_range
left outer join ranked_logs on ranked_logs.log_hour = time_range.hour and ranked_logs.rank = 1)
SELECT
log_hour, log_val, value_partition, first_value(log_val) over (partition by value_partition order by log_hour)
FROM (
SELECT
date_trunc('hour', base.lh) as log_hour,
log_val,
sum(case when log_val is null then 0 else 1 end) over (order by base.lh) as value_partition
FROM base) as q
UPDATE
this is what your query return
Timestamp | Value
2015-07-01 01:00 | 10
2015-07-01 02:00 | null
2015-07-01 03:00 | null
2015-07-01 04:00 | 15
2015-07-01 05:00 | nul
2015-07-01 06:00 | 19
2015-07-01 08:00 | 13
I want this result set to be split in groups like this
2015-07-01 01:00 | 10
2015-07-01 02:00 | null
2015-07-01 03:00 | null
2015-07-01 04:00 | 15
2015-07-01 05:00 | nul
2015-07-01 06:00 | 19
2015-07-01 08:00 | 13
and to assign to every row in a group the value of first row from that group (done by last select)
In this case, a method for obtaining the grouping is to create a column which holds the number of
not null values counted until current row and split by this value. (use of sum(case))
value | sum(case)
| 10 | 1 |
| null | 1 |
| null | 1 |
| 15 | 2 | <-- new not null, increment
| nul | 2 |
| 19 | 3 | <-- new not null, increment
| 13 | 4 | <-- new not null, increment
and now I can partion by sum(case)