PostgreSQL - Select splitted rows based on a column value - sql

Could someone please suggest a query which splits items by working minutes per hour?
Source table
start_timestamp
item_id
total_working_minutes
2021-02-01 14:10
A
120
2021-02-01 14:30
B
20
2021-02-01 16:30
A
10
Expected result
timestamp_by_hour
item_id
working_minutes
2021-02-01 14:00
A
50
2021-02-01 14:00
B
20
2021-02-01 15:00
A
60
2021-02-01 16:00
A
20
Thanks in advance!

You can accomplish this using a recursive query, which should work in both Redshift and PostgreSQL. First, extract
The hour and amount of minutes worked the first hour
The total minutes worked
Then, repeat by recursion for each row where the minutes worked in the current hour is less than total minutes worked. In the recursion, increase the starting hour by 1, and reduce total minutes worked by the minutes worked in the preceding hour.
Finally, aggregate the results by hour and ID.
with recursive
split_times(timestamp_by_hour, item_id, working_minutes, total_working_minutes) as
(
select
date_trunc('hour', start_timestamp),
item_id,
least(total_working_minutes, 60 - extract(minutes from start_timestamp)),
total_working_minutes
from work_time
union all
select
timestamp_by_hour + interval '1 hour',
item_id,
least(total_working_minutes - working_minutes, 60),
total_working_minutes - working_minutes
from split_times
where total_working_minutes > working_minutes
)
select timestamp_by_hour, item_id, sum(working_minutes) working_minutes
from split_times
group by timestamp_by_hour, item_id
order by timestamp_by_hour, item_id;
DB Fiddle

Related

SQL - Fuzzy JOIN on Timestamp columns within X amount of time

Say I have two tables:
a:
timestamp
precipitation
2015-08-03 21:00:00 UTC
3
2015-08-03 22:00:00 UTC
3
2015-08-04 3:00:00 UTC
4
2016-02-04 18:00:00 UTC
4
and b:
timestamp
loc
2015-08-03 21:23:00 UTC
San Francisco
2016-02-04 16:04:00 UTC
New York
I want to join to get a table who has fuzzy joined entries where every row in b tries to get joined to a row in a. Criteria:
The time is within 60 minutes. If a match does not exist within 60 minutes, do not include that row in the output.
In the case of a tie where some row in b could join onto two rows in a, pick the closest one in terms of time.
Example Output:
timestamp
loc
precipitation
2015-08-03 21:00:00 UTC
San Francisco
3
What you need is an ASOF join. I don't think there is an easy way to do this with BigQuery. Other databases like Kinetica (and I think Clickhouse) support ASOF functions that can be used to perform 'fuzzy' joins.
The syntax for Kinetica would be something like the following.
SELECT *
FROM a
LEFT JOIN b
ON ASOF(a.timestamp, b.timestamp, INTERVAL '0' MINUTES, INTERVAL '60' MINUTES, MIN)
The ASOF function above sets up an interval of 60 minutes within which to look for matches on the right side table. When there are multiple matches, it selects the one that is closest (MAX would pick the one that is farthest away).
As per my understanding and based on the data you provided I think the below query should work for your use case.
create temporary table a as(
select TIMESTAMP('2015-08-03 21:00:00 UTC') as ts, 3 as precipitation union all
select TIMESTAMP('2015-08-03 22:00:00 UTC'), 3 union all
select TIMESTAMP('2015-08-04 3:00:00 UTC'), 4 union all
select TIMESTAMP('2016-02-04 18:00:00 UTC'), 4
);
create temporary table b as(
select TIMESTAMP('2015-08-03 21:23:00 UTC') as ts,'San Francisco ' as loc union all
select TIMESTAMP('2016-02-04 14:04:00 UTC') as ts,'New York ' as loc
);
select b_ts,a_ts,loc,precipitation,diff_time_sec
from(
select b.ts b_ts,a.ts a_ts,
ABS(TIMESTAMP_DIFF(b.ts,a.ts, SECOND)) as diff_time_sec,
*
from b
inner join a on b.ts between date_sub(a.ts, interval 60 MINUTE) and date_add(a.ts, interval 60 MINUTE)
)
qualify RANK() OVER(partition by b_ts ORDER BY diff_time_sec) = 1

Combine rows by consecutive timestamp

I have an input table as below:
name
time
price
one
2022-11-22 19:00:00 UTC
12
one
2022-11-23 7:00:00 UTC
24
one
2022-11-23 19:00:00 UTC
10
one
2022-11-24 7:00:00 UTC
20
My expected output is:
name
time
price
one
2022-11-22
36
one
2022-11-23
30
Explanation:
I have to group-by 2 consecutive timestamps, the prev date 19:00:00 UTC and the next date 7:00:00 UTC and name the row with the prev date.
Sum the price for each 2 consecutive rows.
Approach:
As I understand, I have to use partition by on the time column, but I cannot figure out how to combine with exact timestamps.
with cte as (
select name,
time,
price,
(row_number() over (partition by name order by time)+1) div 2 as group_no
from consec_data)
select name,
min(time) as time,
sum(price) as price
from cte
group by name, group_no;

How do I find all recrods with timestamp more than 120 days ago

I need a query to find all records with a timestamp greater than 120 days.
select * from table where timestamp > 120 days
How would i compare the timestamp to 120 days? Ive only compared dates which seems to be a lot easier.
If your table name is "table_name" and timestamp column is "date_col" then you can use below query:
select * from table_name where date_col > DATE_SUB(CURDATE(), INTERVAL 120 DAY);
I suspect you want more than 120 days AGO. That would:
where timestamp < current_date - interval 120 day
If you want the most recent 120 days worth of data reverse the comparison.
if you are working with Db2, and if I got you correctly, you want rows older than 120 days.
So syntax would be:
select * from table where timestamp < (current date - 120 days)
Sample:
DDATE
----------
2020-12-01
2020-11-01
2020-10-01
2020-09-01
4 record(s) selected.
the query would return only..
DDATE
----------
2020-09-01
1 record(s) selected.

postgresql query to get counts between 12:00 and 12:00

I have the following query that works fine, but it is giving me counts for a single, whole day (00:00 to 23:59 UTC). For example, it's giving me counts for all of January 1 2017 (00:00 to 23:59 UTC).
My dataset lends itself to be queried from 12:00 UTC to 12:00 UTC. For example, I'm looking for all counts from Jan 1 2017 12:00 UTC to Jan 2 2017 12:00 UTC.
Here is my query:
SELECT count(DISTINCT ltg_data.lat), cwa, to_char(time, 'MM/DD/YYYY')
FROM counties
JOIN ltg_data on ST_contains(counties.the_geom, ltg_data.ltg_geom)
WHERE cwa = 'MFR'
AND time BETWEEN '1987-06-01'
AND '1992-08-1'
GROUP BY cwa, to_char(time, 'MM/DD/YYYY');
FYI...I'm changing the format of the time so I can use the results more readily in javascript.
And a description of the dataset. It's thousands of point data that occurs within various polygons every second. I'm determining if the points are occurring withing the polygon "cwa = MFR" and then counting them.
Thanks for any help!
I see two approaches here.
first, join generate_series(start_date::timestamp,end_date,'12 hours'::interval) to get count in those generate_series. this would be more correct I believe. But it has a major minus - you have to lateral join it against existing data set to use min(time) and max(time)...
second, a monkey hack itself, but much less coding and less querying. Use different time zone to make 12:00 a start of the day, eg (you did not give the sample, so I generate content of counties with generate_series with 2 hours interval as sample data):
t=# with counties as (select generate_series('2017-09-01'::timestamptz,'2017-09-04'::timestamptz,'2 hours'::interval)
g)
select count(1),to_char(g,'MM/DD/YYYY') from counties
group by to_char(g,'MM/DD/YYYY')
order by 2;
count | to_char
-------+------------
12 | 09/01/2017
12 | 09/02/2017
12 | 09/03/2017
1 | 09/04/2017
(4 rows)
so for UTC time zone there are 12 two hours interval rows for days above, due to inclusive nature of generate_series in my sample, 1 row for last days. in general: 37 rows.
Now a monkey hack:
t=# with counties as (select generate_series('2017-09-01'::timestamptz,'2017-09-04'::timestamptz,'2 hours'::interval)
g)
select count(1),to_char(g at time zone 'utc+12','MM/DD/YYYY') from counties
group by to_char(g at time zone 'utc+12','MM/DD/YYYY')
order by 2;
count | to_char
-------+------------
6 | 08/31/2017
12 | 09/01/2017
12 | 09/02/2017
7 | 09/03/2017
(4 rows)
I select same dates for different time zone, switching it exactly 12 hours, getting first day starting at 31 Aug middday, not 1 Sep midnight, and the count changes, still totalling 37 rows, but grouping your requested way...
update
for your query I'd try smth like:
SELECT count(DISTINCT ltg_data.lat), cwa, to_char(time at time zone 'utc+12', 'MM/DD/YYYY')
FROM counties
JOIN ltg_data on ST_contains(counties.the_geom, ltg_data.ltg_geom)
WHERE cwa = 'MFR'
AND time BETWEEN '1987-06-01'
AND '1992-08-1'
GROUP BY cwa, to_char(time at time zone 'utc+12', 'MM/DD/YYYY');
also if you want to apply +12 hours logic to where clause - add at time zone 'utc+12' to "time" comparison as well

postgresql group by sequence time stamps

I'm about two weeks old in SQL years so if you could humor me a little it would be very helpful.
I'm having trouble figuring out how to group by a series of sequential timestamps (hour steps in this case).
For example:
ID time
1 2008-11-11 01:00:00
2 2008-11-11 02:00:00
3 2008-11-11 04:00:00
4 2008-11-11 05:00:00
5 2008-11-11 06:00:00
6 2008-11-11 08:00:00
I'd like to end up with grouping like so:
Group above_table_ID's
1 1,2
2 3,4,5
3 6
This would be easy to express in the python loop or something but I really don't understand how to express this type of logic in sql/postgresql.
If anyone could help explain this process to me it would be greatly appreciated.
thank you
You can do this by subtracting an increasing number from the time stamps, in hours. Things that are sequential will have the same value.
select row_number() over (order by grp) as GroupId,
string_agg(id, ',') as ids
from (select t.*,
(time - row_number() over (order by time) * interval '1 hour') as grp
from table t
) t
group by grp;