Impala: Split single row into multiple rows based on Date and time - sql

I want to split a single row into multiple rows based on time.
SrNo Employee StartDate EndDate
---------------------------------------------------------------------------
1 emp1 30/03/2020 09:00:00 31/03/2020 07:15:00
2 emp2 01/04/2020 09:00:00 02/04/2020 08:00:00
Expected output is below:
SrNo Employee StartDate EndDate
---------------------------------------------------------------------------
1 emp1 30/03/2020 09:00:00 30/03/2020 11:59:00
1 emp1 31/03/2020 00:00:00 31/03/2020 07:15:00
2 emp2 01/04/2020 09:00:00 01/04/2020 11:59:00
2 emp2 02/04/2020 00:00:00 02/04/2020 08:00:00
Day start from 00:00 AM to next day 00:00 AM. When EndDate time is greater than 00:00 AM (midnight) then split this date in two rows. First row end date is 30/03/2020 11:59:00 and next row start 31/03/2020 00:00:00.
Please help me to get is solved.

This would be a good spot for a recursive CTE, but unfortunatly Hive does not support those. Here is another aproach, that uses a derived table of numbers to split the periods:
select
t.SrNo,
t.Employee,
greatest(t.startDate, date_add(to_date(t.startDate), x.n)) startDate,
least(t.endDate, date_add(to_date(t.startDate), x.n + 1)) endDate
from mytable t
inner join (select 0 n union all select 1 union all select 2) x
on date_add(to_date(t.startDate), x.n) <= t.endDate
You can expand the subquery to handle more possible periods per row.
Also note that this generates half-open intervals, where the end of the preceding interval is equal to the start of the next one (while in your resultset there is a one minute lag). The logic is that the interval is inclusive on its smaller bound and exclusive on the the outer bound (that way, you make sure to not leave any gap).

Related

How can i create a new column count in SQL table where count=1 if hours column >=6 else count=0

I aim to first achieve this
id
employee
Datelog
TimeIn
TimeOut
Hours
Count
5
Two
2022-08-10
09:00:00
16:00:00
07:00:00
1
4
Two
2022-08-09
09:00:00
16:00:00
07:00:00
1
3
Two
2022-08-08
09:00:00
16:00:00
07:00:00
1
2
One
2022-08-05
09:00:00
16:00:00
07:00:00
1
1
Two
2022-08-04
09:00:00
10:00:00
01:00:00
0
and now my main objective here is to give a bonus of 2k to employees whose Totalcount per month >=3.
employee
Month
TotalCount
Bonus
Two
August
3
2000
One
August
1
0
Here's the answer using Postgres. It's pretty much generic other than extracting the month out of datelog that might have a slightly different syntax.
select employee
,max(date_part('month', datelog ))
,count(*)
,case when count(*) >= 3 then 2000 else 0 end as bonus
from t
where hours >= time '06:00:00'
group by employee
employee
max
count
bonus
Two
8
3
2000
One
8
1
0
Fiddle

Rolling Sum Calculation Based on 2 Date Fields

Giving up after a few hours of failed attempts.
My data is in the following format - event_date can never be higher than create_date.
I'd need to calculate on a rolling n-day basis (let's say 3) the sum of units where the create_date and event_date were within the same 3-day window. The data is illustrative but each event_date can have over 500+ different create_dates associated with it and the number isn't constant. There is a possibility of event_dates missing.
So let's say for 2022-02-03, I only want to sum units where both the event_date and create_date values were between 2022-02-01 and 2022-02-03.
event_date
create_date
rowid
units
2022-02-01
2022-01-20
1
100
2022-02-01
2022-02-01
2
100
2022-02-02
2022-01-21
3
100
2022-02-02
2022-01-23
4
100
2022-02-02
2022-01-31
5
100
2022-02-02
2022-02-02
6
100
2022-02-03
2022-01-30
7
100
2022-02-03
2022-02-01
8
100
2022-02-03
2022-02-03
9
100
2022-02-05
2022-02-01
10
100
2022-02-05
2022-02-03
11
100
The output I'd need to get to (added in brackets the rows I'd need to include in the calculation for each date but my result would only need to include the numerical sum) . I tried calculating using either dates but neither of them returned the results I needed.
date
units
2022-02-01
100 (Row 2)
2022-02-02
300 (Row 2,5,6)
2022-02-03
300 (Row 2,6,8,9)
2022-02-04
200 (Row 6,9)
2022-02-05
200 (Row 9,11)
In Python I solved above with a definition that looped through filtering a dataframe for each date but I am struggling to do the same in SQL.
Thank you!
Consider below approach
with events_dates as (
select date from (
select min(event_date) min_date, max(event_date) max_date
from your_table
), unnest(generate_date_array(min_date, max_date)) date
)
select date, sum(units) as units, string_agg('' || rowid) rows_included
from events_dates
left join your_table
on create_date between date - 2 and date
and event_date between date - 2 and date
group by date
if applied to sample data in your question - output is

How to find entry that is between two dates?

I have a table as:
Id start_timestamp end_timestamp
1 2021-07-12 03:00:00 2021-07-13 11:58:05
2 2021-07-13 04:00:00 2021-07-13 05:00:00
3 2021-07-13 04:00:00 2021-07-13 09:00:00
4 2021-07-13 04:00:00 NULL
5 2020-04-10 04:00:00 2020-04-10 04:01:00
....
I want to find all records that fall between two specific timestamps? Basically I'm looking to understand what process ran during a high pick time of the day (it doesn't matter if they have 1 sec in the window or hours.. just occurrence in the window is enough)
So if the timestamps are 2021-07-13 00:00:00 to 2021-07-13 04:30:00
The query will return
1
2
3
4
How can I do that with SQL? (Preferably Presto)
This is the overlapping range problem. You may use:
SELECT *
FROM yourTable
WHERE
(end_timestamp > '2021-07-13 00:00:00' OR end_timestamp IS NULL) AND
(start_timestamp < '2021-07-13 04:30:00' OR start_timestamp IS NULL);
My answer assumes that a missing start/end timestamp value in the table logically means that this value should not be considered. This seems to be the logic you want here.

selecting a date n days ago excluding weekends

I have a table with daily dates starting from 31st December 1999 up to 31st December 2050, excluding weekends.
Say given a particular date, for this example lets use 2019-03-14. I want to pick the date that was 30 days previous (the number of days needs to be flexible as it won't always be 30), ignoring weekends which in this case would be 2019-02-01.
How to do this?
I wrote the query below & it indeed lists 30 days previous to the specified date.
select top 30 Date
from DateDimension
where IsWeekend = 0 and Date <= '2019-03-14'
order by Date desc
So I thought I could use the query below to get the correct answer of 2019-02-01
;with ds as
(
select top 30 Date
from DateDimension
where IsWeekend = 0 and Date <= '2019-03-14'
)
select min(Date) from ds
However this doesn't work. It returns me the first date in my table, 1999-12-31.
2019-03-14
2019-03-13
2019-03-12
2019-03-11
2019-03-08
2019-03-07
2019-03-06
2019-03-05
2019-03-04
2019-03-01
2019-02-28
2019-02-27
2019-02-26
2019-02-25
2019-02-22
2019-02-21
2019-02-20
2019-02-19
2019-02-18
2019-02-15
2019-02-14
2019-02-13
2019-02-12
2019-02-11
2019-02-08
2019-02-07
2019-02-06
2019-02-05
2019-02-04
2019-02-01
TOP is meaningless without an ORDER BY, so you could do something like
;with ds as
(
select top 30 Date
from DateDimension
where IsWeekend = 0 and Date <= '2019-03-14'
order by Date DESC
)
select min(Date) from ds;
even better would be to use the ANSI syntax instead of TOP:
select Date
from DateDimension
where IsWeekend = 0 and Date <= '2019-03-14'
order by Date DESC
OFFSET 30 ROWS FETCH NEXT 1 ROW ONLY;
DISCLAIMER - code not tested since you did not provide DDL and sample data
HTH

SQL SELECT Difference between two days greater than 1 day

I have table T1
ID SCHEDULESTART SCHEDULEFINISH
1 2018-05-12 14:00:00 2018-05-14 11:00:00
2 2018-05-30 14:00:00 2018-06-01 11:00:00
3 2018-02-28 14:00:00 2018-03-02 11:00:00
4 2018-02-28 14:00:00 2018-03-01 11:00:00
5 2018-05-30 14:00:00 2018-05-31 11:00:00
I want to select all rows where difference in days (it's not important difference in hours) is greater than 1 day.
If SCHEDULESTART or SCHEDULEFINISH are on the same day or SCHEDULEFINISH is on next day then these rows should NOT be selected.
So the result should return rows with IDs: 1 2 3
because first row have difference in two days, second row (1st June is 2 days after 30th May ) and 3rd row (2nd March is 2 days after 28 February).
Is this possible somehow?
I know the function DAY but this will return only day number in that one month!!!
I must beging my query with
SELECT ID FROM T1 WHERE ...
Thanks in advance
In DB2, this should work:
select t1.*
from t1
where date(schedulestart) < date(schedulefinish) - 1 day;