hive : group records by time interval - hive

Trying to figure out a way to group records for time interval in hive table. Time is in hh:mm:ss string format. I want to group column date and count occurences for one hour time interval.
Below if the sample table
txn_dt
txn_tm
Txn_amt
2022-05-01
00:00:15
$50
2022-05-01
02:00:05
$150
2022-05-01
02:00:15
$510
Output should be :
txn_dt
interval
TXN_count
2022-05-01
00:00:00
1
2022-05-01
01:00:00
0
2022-05-01
02:00:00
2
i am using this query but no luck :
select txn_dt, (unix_timestamp(txn_tm, 'HH:MM:SS')/3600), COUNT(*) as txn_count FROM TABLE
GROUP BY txn_dt , FLOOR(UNIX_TIMESTAMP(txn_tm,'HH:mm:ss')/3600)

Related

PostgreSQL - Select splitted rows based on a column value

Could someone please suggest a query which splits items by working minutes per hour?
Source table
start_timestamp
item_id
total_working_minutes
2021-02-01 14:10
A
120
2021-02-01 14:30
B
20
2021-02-01 16:30
A
10
Expected result
timestamp_by_hour
item_id
working_minutes
2021-02-01 14:00
A
50
2021-02-01 14:00
B
20
2021-02-01 15:00
A
60
2021-02-01 16:00
A
20
Thanks in advance!
You can accomplish this using a recursive query, which should work in both Redshift and PostgreSQL. First, extract
The hour and amount of minutes worked the first hour
The total minutes worked
Then, repeat by recursion for each row where the minutes worked in the current hour is less than total minutes worked. In the recursion, increase the starting hour by 1, and reduce total minutes worked by the minutes worked in the preceding hour.
Finally, aggregate the results by hour and ID.
with recursive
split_times(timestamp_by_hour, item_id, working_minutes, total_working_minutes) as
(
select
date_trunc('hour', start_timestamp),
item_id,
least(total_working_minutes, 60 - extract(minutes from start_timestamp)),
total_working_minutes
from work_time
union all
select
timestamp_by_hour + interval '1 hour',
item_id,
least(total_working_minutes - working_minutes, 60),
total_working_minutes - working_minutes
from split_times
where total_working_minutes > working_minutes
)
select timestamp_by_hour, item_id, sum(working_minutes) working_minutes
from split_times
group by timestamp_by_hour, item_id
order by timestamp_by_hour, item_id;
DB Fiddle

How to calculate total worktime per week [SQL]

I have a table of EMPLOYEES that contains information about the DATE and WORKTIME per that day. Fx:
ID | DATE | WORKTIME |
----------------------------------------
1 | 1-Sep-2014 | 4 |
2 | 2-Sep-2014 | 6 |
1 | 3-Sep-2014 | 5.5 |
1 | 4-Sep-2014 | 7 |
2 | 4-Sep-2014 | 4 |
1 | 9-Sep-2014 | 8 |
and so on.
Question: How can I create a query that would allow me to calculate amount of time worked per week (HOURS_PERWEEK). I understand that I need a summation of WORKTIME together with grouping considering both, ID and week, but so far my trials as well as googling didnt yield any results. Any ideas on this? Thank you in advance!
edit:
Got a solution of
select id, sum (worktime), trunc(date, 'IW') week
from employees
group by id, TRUNC(date, 'IW');
But will need somehow to connect that particular output with DATE table by updating a newly created column such as WEEKLY_TIME. Any hints on that?
You can find the start of the ISO week, which will always be a Monday, using TRUNC("DATE", 'IW').
So if, in the query, you GROUP BY the id and the start of the week TRUNC("DATE", 'IW') then you can SELECT the id and aggregate to find the SUM the WORKTIME column for each id.
Since this appears to be a homework question and you haven't attempted a query, I'll leave it at this to point you in the correct direction and you can complete the query.
Update
Now I need to create another column (lets call it WEEKLY_TIME) and populate it with values from the current output, so that Sep 1,3,4 (for ID=1) would all contain value 16.5, specifying that on that day (that is within the certain week) that person worked 16.5 in total. And for ID=2 it would then be a value of 10 for both Sep 2 and 4.
For this, if I understand correctly, you appear to not want to use aggregation functions and want to use the analytic version of the function:
select id,
"DATE",
trunc("DATE", 'IW') week,
worktime,
sum (worktime) OVER (PARTITION BY id, trunc("DATE", 'IW'))
AS weekly_time
from employees;
Which, for the sample data:
CREATE TABLE employees (ID, "DATE", WORKTIME) AS
SELECT 1, DATE '2014-09-01', 4 FROM DUAL UNION ALL
SELECT 2, DATE '2014-09-02', 6 FROM DUAL UNION ALL
SELECT 1, DATE '2014-09-03', 5.5 FROM DUAL UNION ALL
SELECT 1, DATE '2014-09-04', 7 FROM DUAL UNION ALL
SELECT 2, DATE '2014-09-04', 4 FROM DUAL UNION ALL
SELECT 1, DATE '2014-09-09', 8 FROM DUAL;
Outputs:
ID
DATE
WEEK
WORKTIME
WEEKLY_TIME
1
2014-09-01 00:00:00
2014-09-01 00:00:00
4
16.5
1
2014-09-03 00:00:00
2014-09-01 00:00:00
5.5
16.5
1
2014-09-04 00:00:00
2014-09-01 00:00:00
7
16.5
1
2014-09-09 00:00:00
2014-09-08 00:00:00
8
8
2
2014-09-04 00:00:00
2014-09-01 00:00:00
4
10
2
2014-09-02 00:00:00
2014-09-01 00:00:00
6
10
db<>fiddle here
edit: answer submitted without noticing "Oracle" tag. Otherwise, question answered here: Oracle SQL - Sum and group data by week
Select employee_Id,
DATEPART(week, workday) as [Week],
sum (worktime) as [Weekly Hours]
from WORK
group by employee_id, DATEPART(week, workday)
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=238b229156a383fa3c466b6c3c2dee1e

PGSQL query to get a list of sequential dates from today

I am having an calendar table where I have added the list of dates on which no action should be performed.
The table is as follows and the date format is YYYY-MM-DD
date
2021-01-01
2021-04-05
2021-04-06
2021-04-07
2021-08-10
2021-11-22
2021-11-23
2021-11-24
2021-12-25
2021-12-31
Considering today is 2021-11-24.
The expected output is
date
2021-11-24
2021-11-23
2021-11-22
And Considering today is 2021-12-25
then the expected output is
date
2021-12-25
And Considering today is 2021-12-27
then the output should contain no data.
date
It should get me the sequence with today's date in descending order without a break of sequence.
I searched on various posts I did find some of the posts related to my question but the query was little complex with nested subqueries. Is there a way to achieve the output in a more optimized way. I am new to pgsql.
Create example table:
CREATE TABLE calendar (d date);
INSERT INTO calendar VALUES ('2021-11-23'),('2021-11-20');
Query:
SELECT * FROM
(SELECT CURRENT_DATE - '1 day'::interval * generate_series(0,10) AS d) a
LEFT JOIN calendar c ON (c.d=a.d);
a.d | c.d
---------------------+------------
2021-11-14 00:00:00 | Null
2021-11-15 00:00:00 | Null
2021-11-16 00:00:00 | Null
2021-11-17 00:00:00 | Null
2021-11-18 00:00:00 | Null
2021-11-19 00:00:00 | Null
2021-11-20 00:00:00 | 2021-11-20
2021-11-21 00:00:00 | Null
2021-11-22 00:00:00 | Null
2021-11-23 00:00:00 | 2021-11-23
2021-11-24 00:00:00 | Null
Subquery "a" generates a date series, and then we join it to the table.
You can add conditions , for example "WHERE calendar.d IS NULL", or "IS NOT NULL" depending on the filtering you want.
You can simply filter by a date range, building it by subtracting 2 days from today:
select "date"
from maintenance_dates_70099898
where "date" <= now()::date --you want to see today and 2 days prior; Last 3 days total
and "date" >= now()::date - '2 days'::interval
order by 1 desc;
With a runnable test:
drop table if exists maintenance_dates_70099898;
create table maintenance_dates_70099898 ("date" date);
insert into maintenance_dates_70099898
("date")
values
('2021-01-01'),
('2021-04-05'),
('2021-04-06'),
('2021-04-07'),
('2021-08-10'),
('2021-11-22'),
('2021-11-23'),
('2021-11-24'),
('2021-12-25'),
('2021-12-31');
select "date"
from maintenance_dates_70099898
where "date" <= now()::date --you want to see today and 2 days prior; Last 3 days total
and "date" >= now()::date - '2 days'::interval
order by 1 desc;
-- date
--------------
-- 2021-11-24
-- 2021-11-23
-- 2021-11-22
--(3 rows)
select "date"
from maintenance_dates_70099898
where "date" >= '2021-12-25'::date - '2 days'::interval
and "date" <= '2021-12-25'::date
order by 1 desc;
-- date
--------------
-- 2021-12-25
--(1 row)
I assume that for 2021-12-27 you do want to see 2021-12-25, as it's within the 3 day range prior.
select "date"
from maintenance_dates_70099898
where "date" >= '2021-12-28'::date - '2 days'::interval
and "date" <= '2021-12-28'::date
order by 1 desc;
-- date
--------
--(0 rows)
The main issue appears to be not having a known number of days thus disabling a simple range validation/selection. However to the rescue there is a RECURSIVE cte to pluck off each previous date that is exactly 1 day prior to the last and terminate when no longer holds.
with recursive no_action(no_act_dt) as
( select no_act_dt
from no_action_calendar
where no_act_dt = :parm_date::date
union all
select c.no_act_dt
from no_action_calendar c
join no_action a
on (c.no_act_dt = a.no_act_dt - 1)
)
select *
from no_action
order by no_act_dt desc;
If you use this often or from several points, you can parametrize it with a SQL function. (see demo for both).
create or replace
function consective_no_action_dates (date_in date)
returns setof date
language sql
as $$
with recursive no_action(no_act_dt) as
( select no_act_dt
from no_action_calendar
where no_act_dt = date_in
union all
select c.no_act_dt
from no_action_calendar c
join no_action a
on (c.no_act_dt = a.no_act_dt - 1)
)
select *
from no_action
order by no_act_dt desc;
$$;

How to average values in one table based on the condition involving another table in SQL?

I have two tables. One defines time intervals (beginning and end). Time intervals are not equal in length. Another contains product ID, start and end date of the product.
TableOne:
Interval StartDateTime EndDateTime
202020201 2020-01-01 00:00:00 2020-02-10 00:00:00
202020202 2020-02-10 00:00:00 2020-02-20 00:00:00
TableTwo
ProductID ProductStartDateTime ProductEndDateTime
ASSDWE1 2018-01-04 00:12:00 2020-04-10 20:00:30
ADFGHER 2020-01-05 00:11:30 2020-01-19 00:00:00
ASDFVBN 2017-10-10 00:12:10 2020-02-23 00:23:23
I need to compute the average length of the products from TableTwo that existed during time intervals defined in TableOne. If the product existed throughout the time interval from TableOne, then the length of the product during this time interval is defined as it length since its start date till the end of the time interval.
I tried the following
select
a.*,
(select
AVG(datediff(day, b.ProductStartDateTime, IIF (b.ProductEndDateTime> a.EndDateTime, a.EndDateTime
,b.ProductEndDateTime))) --compute average length of the products
FROM #TableTwo b
WHERE ( not (b.ProductEndDateTime <= a.StartDateTime ) and not (b.ProductStartDateTime >= a.EndDateTime) )
-- select products that existed during interval from #TableOne
) as AverageProductLength
from #TableOne a
I get the mistake "Multiple columns are specified in an aggregated expression containing an outer reference. If an expression being aggregated contains an outer reference, then that outer reference must be the only column referenced in the expression."
The result I want:
Interval StartDateTime EndDateTime AverageProductLength
202020201 2020-01-01 00:00:00 2020-02-10 00:00:00 23
202020202 2020-02-10 00:00:00 2020-02-20 00:00:00 34.5
Is there a way I can do the averaging?

How can I extract the values of the last aggregation date in sql

I have the following table.
id user time_stamp
1 Mike 2020-02-13 00:00:00 UTC
2 John 2020-02-13 00:00:00 UTC
3 Levy 2020-02-12 00:00:00 UTC
4 Sam 2020-02-12 00:00:00 UTC
5 Frodo 2020-02-11 00:00:00 UTC
Let's say 2020-02-13 00:00:00 UTC is the last day and I would like to query this table to only display last days results? I want to create a view in Bigquery so that I only and always get the last day's results?
So that in the end I get something like this (For last day which is 2020-02-13 00:00:00 UTC )
id user time_stamp
1 Mike 2020-02-13 00:00:00 UTC
2 John 2020-02-13 00:00:00 UTC
You can use window functions:
select t.* except (seqnum)
from (select t.*,
dense_rank() over (order by time_stamp) as seqnum
from t
) t
where seqnum = 1;
This may not work well on a large amount of data -- because of the way that BQ implements window functions with no partitioning. So, you might find that this works better (especially if the above runs out of resources):
select t.*
from t join
(select max(time_stamp) as max_time_stamp
from t
) tt
on t.time_stamp = max_time_stamp;
Also, if the timestamps actually have date components, then you will want to convert to a date or remove the time component somehow.