hive server pushing parquet timestamp values by one hour in BST timezone - hive

'HDP--3.1.4,The table containing the parquet timestamp which has hourly data ,hive server is pushing the hour data into into next date example is shown below , please check before and after 29 th Mar 2020 , where Mar 29 is the BST time settings with day light saving'
| 2020-03-22 | 2020-03-22 00:00:59.0 | 2020-03-22 23:59:59.0 |
| 2020-03-23 | 2020-03-23 00:00:59.0 | 2020-03-23 23:59:59.0 |
| 2020-03-24 | 2020-03-24 00:00:59.0 | 2020-03-24 23:59:59.0 |
| 2020-03-25 | 2020-03-25 00:00:59.0 | 2020-03-25 23:59:59.0 |
| 2020-03-26 | 2020-03-26 00:00:59.0 | 2020-03-26 23:59:59.0 |
| 2020-03-27 | 2020-03-27 00:00:59.0 | 2020-03-27 23:59:59.0 |
| 2020-03-28 | 2020-03-28 00:00:59.0 | 2020-03-28 23:59:59.0 |
| 2020-03-29 | 2020-03-29 00:00:59.0 | 2020-03-30 00:59:59.0 |
| 2020-03-30 | 2020-03-30 01:00:59.0 | 2020-03-31 00:59:59.0 |
| 2020-03-31 | 2020-03-31 01:00:59.0 | 2020-04-01 00:59:59.0 |
| 2020-04-01 | 2020-04-01 01:00:59.0 | 2020-04-02 00:59:59.0 |
| 2020-04-02 | 2020-04-02 01:00:59.0 | 2020-04-03 00:59:59.0 |

When writing to parquet table in hive make sure the timestamp values are in UTC and set time zone in hive to match the local timezone .
set time zone LOCAL;
or
set time zone '+1:00'

Related

SQL - Split open & Close time Into intervals of 30 minutes

Purpose: I work in Hospitality Industry. I want to understand at what time the Restaurant is full and what time it is less busy. I have the opening and closing times, I want to split it 30 minute interval period.
I would really appreciate if you could ease help me.
Thanking you in advance
Table
Check# Open CloseTime
25484 17:34 18:06
25488 18:04 21:22
Output
Check# Open Close Duration
25484 17:34 18:00 0:25
25484 18:00 18:30 0:30
25488 18:08 18:30 0:21
25488 18:30 19:00 0:30
25488 19:00 19:30 0:30
25488 19:30 20:00 0:30
25488 20:00 20:30 0:30
25488 20:30 21:00 0:30
25488 21:00 21:30 0:30
I am new to SQL. I am good at Excel, but due to its limitations i want to use SQL. I just know the basics in SQL.
I have tried on the google, but could not find solution to it. All i can see use of Date Keywords, but not the Field name in the code, hence i am unable to use them.
Could you try this, it works in MySQL 8.0:
WITH RECURSIVE times AS (
SELECT time '0:00' AS `Open`, time '0:30' as `Close`
UNION ALL
SELECT addtime(`Open`, '0:30'), addtime(`Close`, '0:30')
FROM times
WHERE `Open` < time '23:30'
)
SELECT c.`Check`,
greatest(t.`Open`, c.`Open`) `Open`,
least(t.`Close`, c.`CloseTime`) `Close`,
timediff(least(t.`Close`, c.`CloseTime`), greatest(t.`Open`, c.`Open`)) `Duration`
FROM times t
JOIN checks c ON (c.`Open` < t.`Close` AND c.`CloseTime` > t.`Open`);
| Check | Open | Close | Duration |
| ----- | -------- | -------- | -------- |
| 25484 | 17:34:00 | 18:00:00 | 00:26:00 |
| 25484 | 18:00:00 | 18:06:00 | 00:06:00 |
| 25488 | 18:04:00 | 18:30:00 | 00:26:00 |
| 25488 | 18:30:00 | 19:00:00 | 00:30:00 |
| 25488 | 19:00:00 | 19:30:00 | 00:30:00 |
| 25488 | 19:30:00 | 20:00:00 | 00:30:00 |
| 25488 | 20:00:00 | 20:30:00 | 00:30:00 |
| 25488 | 20:30:00 | 21:00:00 | 00:30:00 |
| 25488 | 21:00:00 | 21:22:00 | 00:22:00 |
->Fiddle
This works for SQL Server 2019:
WITH times([Open], [Close]) AS (
SELECT cast({t'00:00:00'} as time) as "Open",
cast({t'00:30:00'} as time) as "Close"
UNION ALL
SELECT dateadd(minute, 30, [Open]), dateadd(minute, 30, [Close])
FROM times
WHERE [Open] < cast({t'23:30:00'} as time)
)
SELECT c.[Check],
iif(t.[Open] > c.[Open], t.[Open], c.[Open]) as [Open],
iif(t.[Close] < c.[CloseTime], t.[Close], c.[CloseTime]) as [Close],
datediff(minute,
iif(t.[Open] > c.[Open], t.[Open], c.[Open]),
iif(t.[Close] < c.[CloseTime], t.[Close], c.[CloseTime])) Duration
FROM times t
JOIN checks c ON (c.[Open] < t.[Close] AND c.[CloseTime] > t.[Open]);
Check | Open | Close | Duration
25484 | 17:34:00.0000000 | 18:00:00.0000000 | 26
25484 | 18:00:00.0000000 | 18:06:00.0000000 | 6
25488 | 18:04:00.0000000 | 18:30:00.0000000 | 26
25488 | 18:30:00.0000000 | 19:00:00.0000000 | 30
25488 | 19:00:00.0000000 | 19:30:00.0000000 | 30
25488 | 19:30:00.0000000 | 20:00:00.0000000 | 30
25488 | 20:00:00.0000000 | 20:30:00.0000000 | 30
25488 | 20:30:00.0000000 | 21:00:00.0000000 | 30
25488 | 21:00:00.0000000 | 21:22:00.0000000 | 22
->Fiddle

How to resample pandas to hydrologic year (Sep 1 - Aug 31)

I'd like to analyze some daily data by hydrologic year: From 1 September to 31 August. I've created a synthetic data set with:
import pandas as pd
t = pd.date_range(start='2015-01-01', freq='D', end='2021-09-03')
df = pd.DataFrame(index = t)
df['hydro_year'] = df.index.year
df['hydro_year'].loc[df.index.month >= 9] += 1
df['id'] = df['hydro_year'] - df.index.year[0]
df['count'] = 1
Note that in reality I do not have a hydro_year column so I do not use groupby. I would expect the following to resample by hydrologic year:
print(df['2015-09-01':].resample('12M').agg({'hydro_year':'mean','id':'mean','count':'sum'}))
But the output does not align:
| | hydro_year | id | count |
|---------------------+------------+---------+-------|
| 2015-09-30 00:00:00 | 2016 | 1 | 30 |
| 2016-09-30 00:00:00 | 2016.08 | 1.08197 | 366 |
| 2017-09-30 00:00:00 | 2017.08 | 2.08219 | 365 |
| 2018-09-30 00:00:00 | 2018.08 | 3.08219 | 365 |
| 2019-09-30 00:00:00 | 2019.08 | 4.08219 | 365 |
| 2020-09-30 00:00:00 | 2020.08 | 5.08197 | 366 |
| 2021-09-30 00:00:00 | 2021.01 | 6.00888 | 338 |
However, if I start a day earlier, then things do align, except the first day is 'early' and dangling alone...
| | hydro_year | id | count |
|---------------------+------------+----+-------|
| 2015-08-31 00:00:00 | 2015 | 0 | 1 |
| 2016-08-31 00:00:00 | 2016 | 1 | 366 |
| 2017-08-31 00:00:00 | 2017 | 2 | 365 |
| 2018-08-31 00:00:00 | 2018 | 3 | 365 |
| 2019-08-31 00:00:00 | 2019 | 4 | 365 |
| 2020-08-31 00:00:00 | 2020 | 5 | 366 |
| 2021-08-31 00:00:00 | 2021 | 6 | 365 |
| 2022-08-31 00:00:00 | 2022 | 7 | 3 |
IIUC, you can use 12MS (Start) instead of 12M:
>>> df['2015-09-01':].resample('12MS') \
.agg({'hydro_year':'mean','id':'mean','count':'sum'})
hydro_year id count
2015-09-01 2016.0 1.0 366
2016-09-01 2017.0 2.0 365
2017-09-01 2018.0 3.0 365
2018-09-01 2019.0 4.0 365
2019-09-01 2020.0 5.0 366
2020-09-01 2021.0 6.0 365
2021-09-01 2022.0 7.0 3
We can try with Anchored Offsets annually starting with SEP:
resampled_df = df['2015-09-01':].resample('AS-SEP').agg({
'hydro_year': 'mean', 'id': 'mean', 'count': 'sum'
})
hydro_year id count
2015-09-01 2016.0 1.0 366
2016-09-01 2017.0 2.0 365
2017-09-01 2018.0 3.0 365
2018-09-01 2019.0 4.0 365
2019-09-01 2020.0 5.0 366
2020-09-01 2021.0 6.0 365
2021-09-01 2022.0 7.0 3

hql split time into intervals

I have a Hive table with some data and i would like to split it in to 15 minutes intervals et return the total call duration for every interval
Hive Table example :
ID Start End Total Duration
1 1502296261 1502325061 28800
My output should be shown as :
ID Interval Duration
1 2017-08-09 18:30:00 839
1 2017-08-09 18:45:00 900
1 2017-08-09 19:00:00 900
...
1 2017-08-10 02:15:00 900
1 2017-08-10 02:30:00 61
What is the best solution to do that in a efficient way ?
Thanks.
This is the basic solution.
The displayed timestamp (Interval) depends on your system timezone.
with t as (select stack(1,1,1502296261,1502325061) as (`ID`,`Start`,`End`))
select t.`ID` as `ID`
,from_unixtime((t.`Start` div (15*60) + pe.pos)*(15*60)) as `Interval`
, case
when pe.pos = t.`End` div (15*60) - t.`Start` div (15*60)
then t.`End`
else (t.`Start` div (15*60) + pe.pos + 1)*(15*60)
end
- case
when pe.pos = 0
then t.`Start`
else (t.`Start` div (15*60) + pe.pos)*(15*60)
end as `Duration`
from t
lateral view
posexplode(split(space(int(t.`End` div (15*60) - t.`Start` div (15*60))),' ')) pe
;
+----+---------------------+----------+
| id | interval | duration |
+----+---------------------+----------+
| 1 | 2017-08-09 09:30:00 | 839 |
| 1 | 2017-08-09 09:45:00 | 900 |
| 1 | 2017-08-09 10:00:00 | 900 |
| 1 | 2017-08-09 10:15:00 | 900 |
| 1 | 2017-08-09 10:30:00 | 900 |
| 1 | 2017-08-09 10:45:00 | 900 |
| 1 | 2017-08-09 11:00:00 | 900 |
| 1 | 2017-08-09 11:15:00 | 900 |
| 1 | 2017-08-09 11:30:00 | 900 |
| 1 | 2017-08-09 11:45:00 | 900 |
| 1 | 2017-08-09 12:00:00 | 900 |
| 1 | 2017-08-09 12:15:00 | 900 |
| 1 | 2017-08-09 12:30:00 | 900 |
| 1 | 2017-08-09 12:45:00 | 900 |
| 1 | 2017-08-09 13:00:00 | 900 |
| 1 | 2017-08-09 13:15:00 | 900 |
| 1 | 2017-08-09 13:30:00 | 900 |
| 1 | 2017-08-09 13:45:00 | 900 |
| 1 | 2017-08-09 14:00:00 | 900 |
| 1 | 2017-08-09 14:15:00 | 900 |
| 1 | 2017-08-09 14:30:00 | 900 |
| 1 | 2017-08-09 14:45:00 | 900 |
| 1 | 2017-08-09 15:00:00 | 900 |
| 1 | 2017-08-09 15:15:00 | 900 |
| 1 | 2017-08-09 15:30:00 | 900 |
| 1 | 2017-08-09 15:45:00 | 900 |
| 1 | 2017-08-09 16:00:00 | 900 |
| 1 | 2017-08-09 16:15:00 | 900 |
| 1 | 2017-08-09 16:30:00 | 900 |
| 1 | 2017-08-09 16:45:00 | 900 |
| 1 | 2017-08-09 17:00:00 | 900 |
| 1 | 2017-08-09 17:15:00 | 900 |
| 1 | 2017-08-09 17:30:00 | 61 |
+----+---------------------+----------+

Postgres, Update TIMESTAMP to current date but preserve time of day

In my Postgres database, I have the following table:
SELECT start_at, end_at FROM schedules;
+---------------------+---------------------+
| start_at | end_at |
|---------------------+---------------------|
| 2016-09-05 16:30:00 | 2016-09-05 17:30:00 |
| 2016-09-05 17:30:00 | 2016-09-05 18:30:00 |
| 2017-08-13 03:00:00 | 2017-08-13 07:00:00 |
| 2017-08-13 03:00:00 | 2017-08-13 07:00:00 |
| 2017-08-13 18:42:26 | 2017-08-13 21:30:46 |
| 2017-08-10 00:00:00 | 2017-08-10 03:30:00 |
| 2017-08-09 18:00:00 | 2017-08-10 03:00:00 |
| 2017-08-06 23:00:00 | 2017-08-07 03:00:00 |
| 2017-08-07 01:00:00 | 2017-08-07 03:48:20 |
| 2017-08-07 01:00:00 | 2017-08-07 03:48:20 |
| 2017-08-07 18:05:00 | 2017-08-07 20:53:20 |
| 2017-08-07 14:00:00 | 2017-08-08 01:00:00 |
| 2017-08-07 18:00:00 | 2017-08-07 20:48:20 |
| 2017-08-08 08:00:00 | 2017-08-09 00:00:00 |
| 2017-08-09 21:30:00 | 2017-08-10 00:18:20 |
| 2017-08-13 03:53:26 | 2017-08-13 06:41:46 |
+---------------------+---------------------+
Assume I also have an ID column, what I want to do is update all the start and end times to be for today (now), what is the most efficient SQL to accomplish this? My table could have millions of rows.
the best I can think of is this:
update schedules
set start_at = current_date + start_at::time
, end_at = current_date + end_at::time
WHERE start_at::date <> current_date
or end_at::date <> current_date;
The arithmetic is fast compared to accessing the rows.
if not all rows need updating, the where clause will help efficiency. Updates are expensive.

Oracle: Select parallel entries

I am searching the most efficient way to make a relatively complicated query in a relatively large table.
The concept is that:
I have a table that holds records of phases that can run parallel to each other
The amount of records exceeds the 5 millions (and increases)
The time period starts about 5 years ago
Due to performance reasons, this select could be applied on the last 3 months period of time with 300.000 records (only if it is not physically possible to do it for the whole table)
Oracle version: 11g
The data sample seems as following
Table Phases (ID, START_TS, END_TS, PRIO)
1 10:00:00 10:20:10 10
2 10:05:00 10:10:00 11
3 10:05:20 10:15:00 9
4 10:16:00 10:25:00 8
5 10:24:00 10:45:15 1
6 10:26:00 10:30:00 10
7 10:27:00 10:35:00 15
8 10:34:00 10:50:00 5
9 10:50:00 10:55:00 20
10 10:55:00 11:00:00 15
Above you can see how the information is currently stored (of course there are several other columns with irrelevant information).
There are two requirements (or problems to be solved)
If we sum the duration of all the phases, the result is MUCH more than an hour that the above data represent. (There could be holes between the phases, so taking the first start_ts and the last end_ts would not be sufficient).
The data should be displayed in a form that it would be visible which phases run parallel with which and which phase had the highest priority at each time, as shown in the expected view below
Here it is easy to distinct the highest priority phase at each time (HIGHEST_PRIO), and adding their duration would result the actual total duration.
View V_Parallel_Phases (ID, START_TS, END_TS, PRIO, HIGHEST_PRIO)
-> Optional Columns: Part_of_ID / Runs_Parallel
1 10:00:00 10:05:20 10 True (--> Part_1 / False)
1 10:05:20 10:15:00 10 False (--> Part_2 / True)
2 10:05:00 10:10:00 11 False (--> Part_1 / True)
3 10:05:20 10:15:00 9 True (--> Part_1 / True)
1 10:15:00 10:16:00 10 True (--> Part_3 / True)
1 10:16:00 10:20:10 10 False (--> Part_4 / True)
4 10:16:00 10:24:00 8 True (--> Part_1 / True)
4 10:24:00 10:25:00 8 False (--> Part_2 / True)
5 10:24:00 10:45:15 1 True (--> Part_1 / True)
6 10:26:00 10:30:00 10 False (--> Part_1 / True)
7 10:27:00 10:35:00 15 False (--> Part_1 / True)
8 10:34:00 10:45:15 5 False (--> Part_1 / True)
8 10:45:15 10:50:00 5 True (--> Part_2 / True)
9 10:50:00 10:55:00 20 True (--> Part_2 / False)
10 10:55:00 11:00:00 15 True (--> Part_2 / False)
Unfortunately I am not aware of an efficient way to make this query. The current solution was to make the above calculations programmatically in the tool that generates a large report but it was a total failure. From the 30 seconds that were needed before this calculations, now it needs over 10 minutes without taking event into consideration the priorities of the phases..
Then I thought of translating this code into sql in either: a) a view b) a materialized view c) a table that I would fill with a procedure once in a while (depending on the required duration).
PS: I am aware that oracle has some analytical functions that can handle complicated queries but I am not aware of which could actually help me in the current problem.
Thank you in advance!
This is an incomplete answer, but I need to know if this approach is viable before going on. I believe it is possible to do completely in SQL, but I am not sure how the performance will be.
First find out all points in time where there is a transition:
CREATE VIEW Events AS
SELECT START_TS AS TS
FROM Phases
UNION
SELECT END_TS AS TS
FROM Phases
;
Then create (start, end) tuples from those points in time:
CREATE VIEW Segments AS
SELECT START.TS AS START_TS,
MIN(END.TS) AS END_TS
FROM Events AS START
JOIN Events AS END
WHERE START.TS < END.TS
;
From here on, doing the rest should be fairly straight forward. Here is a query that lists the segments and all the phases that are active in the given segment:
SELECT *
FROM Segments
JOIN Phases
WHERE Segments.START_TS BETWEEN Phases.START_TS AND Phases.END_TS
AND Segments.END_TS BETWEEN Phases.START_TS AND Phases.END_TS
ORDER BY Segments.START_TS
;
The rest can be done with subselects and some aggregates.
| START_TS | END_TS | ID | START_TS | END_TS | PRIO |
|----------|----------|----|----------|----------|------|
| 10:00:00 | 10:05:00 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:05:00 | 10:05:20 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:05:00 | 10:05:20 | 2 | 10:05:00 | 10:10:00 | 11 |
| 10:05:20 | 10:10:00 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:05:20 | 10:10:00 | 2 | 10:05:00 | 10:10:00 | 11 |
| 10:05:20 | 10:10:00 | 3 | 10:05:20 | 10:15:00 | 9 |
| 10:10:00 | 10:15:00 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:10:00 | 10:15:00 | 3 | 10:05:20 | 10:15:00 | 9 |
| 10:15:00 | 10:16:00 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:16:00 | 10:20:10 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:16:00 | 10:20:10 | 4 | 10:16:00 | 10:25:00 | 8 |
| 10:20:10 | 10:24:00 | 4 | 10:16:00 | 10:25:00 | 8 |
| 10:24:00 | 10:25:00 | 4 | 10:16:00 | 10:25:00 | 8 |
| 10:24:00 | 10:25:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:25:00 | 10:26:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:26:00 | 10:27:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:26:00 | 10:27:00 | 6 | 10:26:00 | 10:30:00 | 10 |
| 10:27:00 | 10:30:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:27:00 | 10:30:00 | 6 | 10:26:00 | 10:30:00 | 10 |
| 10:27:00 | 10:30:00 | 7 | 10:27:00 | 10:35:00 | 15 |
| 10:30:00 | 10:34:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:30:00 | 10:34:00 | 7 | 10:27:00 | 10:35:00 | 15 |
| 10:34:00 | 10:35:00 | 8 | 10:34:00 | 10:50:00 | 5 |
| 10:34:00 | 10:35:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:34:00 | 10:35:00 | 7 | 10:27:00 | 10:35:00 | 15 |
| 10:35:00 | 10:45:15 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:35:00 | 10:45:15 | 8 | 10:34:00 | 10:50:00 | 5 |
| 10:45:15 | 10:50:00 | 8 | 10:34:00 | 10:50:00 | 5 |
| 10:50:00 | 10:55:00 | 9 | 10:50:00 | 10:55:00 | 20 |
| 10:55:00 | 11:00:00 | 10 | 10:55:00 | 11:00:00 | 15 |
There is a SQL fiddle demonstrating the whole thing here:
http://sqlfiddle.com/#!9/d801b/2