pandas time shift from utc to local - pandas

I am trying to convert utc time to local time. This is what I had before
df_combined_features['timestamp'][1:10]
2013-01-24 2013-01-24 11:00:00
2013-04-25 2013-04-25 10:00:00
2013-07-25 2013-07-25 10:00:00
2013-10-24 2013-10-24 10:00:00
2014-01-30 2014-01-30 11:00:00
2014-04-24 2014-04-24 10:00:00
2014-07-24 2014-07-24 10:00:00
2014-10-23 2014-10-23 10:00:00
2015-01-27 2015-01-27 11:00:00
This is what I did
df_combined_features['time_stamp'].tz_localize('US/Central')[1:10]
2013-01-24 00:00:00-06:00 2013-01-24 11:00:00
2013-04-25 00:00:00-05:00 2013-04-25 10:00:00
2013-07-25 00:00:00-05:00 2013-07-25 10:00:00
2013-10-24 00:00:00-05:00 2013-10-24 10:00:00
2014-01-30 00:00:00-06:00 2014-01-30 11:00:00
2014-04-24 00:00:00-05:00 2014-04-24 10:00:00
2014-07-24 00:00:00-05:00 2014-07-24 10:00:00
2014-10-23 00:00:00-05:00 2014-10-23 10:00:00
2015-01-27 00:00:00-06:00 2015-01-27 11:00:00
I think it did the right thing, but I dont understand the output format. In particular
1) Why do the converted cols appear as the new index?
2) I understand that -06:00 (in the last row) is an hour shift, so the time is 6:00 am, how do I retrieve that information, the exact local time?
Desired output, I want the exact time to be posted, including the offset from utc.
local time utc time
2013-01-24 05:00:00 2013-01-24 11:00:00
2013-04-25 05:00:00 2013-04-25 10:00:00
2013-07-25 05:00:00 2013-07-25 10:00:00
2013-10-24 05:00:00 2013-10-24 10:00:00
2014-01-30 05:00:00 2014-01-30 11:00:00
2014-04-24 05:00:00 2014-04-24 10:00:00
2014-07-24 05:00:00 2014-07-24 10:00:00
2014-10-23 05:00:00 2014-10-23 10:00:00
2015-01-27 05:00:00 2015-01-27 11:00:00

When you call tz.localize you localize the index, if you want to modify the column you need to call dt.localize also to add the timezone offset call dt.tz_convert('UTC'):
In [125]:
df['timestamp'].dt.tz_localize('utc').dt.tz_convert('US/Central')
Out[125]:
index
2013-01-24 2013-01-24 05:00:00-06:00
2013-04-25 2013-04-25 05:00:00-05:00
2013-07-25 2013-07-25 05:00:00-05:00
2013-10-24 2013-10-24 05:00:00-05:00
2014-01-30 2014-01-30 05:00:00-06:00
2014-04-24 2014-04-24 05:00:00-05:00
2014-07-24 2014-07-24 05:00:00-05:00
2014-10-23 2014-10-23 05:00:00-05:00
2015-01-27 2015-01-27 05:00:00-06:00
Name: timestamp, dtype: datetime64[ns, US/Central]
Compare without .dt:
In [126]:
df['timestamp'].tz_localize('utc').tz_convert('US/Central')
Out[126]:
index
2013-01-23 18:00:00-06:00 2013-01-24 11:00:00
2013-04-24 19:00:00-05:00 2013-04-25 10:00:00
2013-07-24 19:00:00-05:00 2013-07-25 10:00:00
2013-10-23 19:00:00-05:00 2013-10-24 10:00:00
2014-01-29 18:00:00-06:00 2014-01-30 11:00:00
2014-04-23 19:00:00-05:00 2014-04-24 10:00:00
2014-07-23 19:00:00-05:00 2014-07-24 10:00:00
2014-10-22 19:00:00-05:00 2014-10-23 10:00:00
2015-01-26 18:00:00-06:00 2015-01-27 11:00:00
Name: timestamp, dtype: datetime64[ns]

Related

Analyze a Time Series

I am inserting data into a table with date/time column.
I want to find speed of inserts during a particular duration as follows :
Duration # of Records
1:00pm - 2:00PM 1000
2:00pm - 3:00PM 1400
.......................
11:00PM- 12:00 1100
Though I can find above by repeatedly executing follows:
select count(*) from table_A where insert_date between 1:00pm and 2:00pm
Is there Oracle supplied package/function which can produce above report - without having to execute separate statements ?
Here's a couple of examples. To get "sparse" results, ie, just the data that exists within the table, you simply use TRUNC
SQL> create table data ( d date );
Table created.
SQL>
SQL> insert into data
2 select date '2022-02-10' + dbms_random.normal/10
3 from dual
4 connect by level <= 10000;
10000 rows created.
SQL>
SQL> select trunc(d,'HH24'), count(*)
2 from data
3 group by trunc(d,'HH24')
4 order by 1;
TRUNC(D,'HH24') COUNT(*)
------------------- ----------
09/02/2022 13:00:00 1
09/02/2022 15:00:00 4
09/02/2022 16:00:00 10
09/02/2022 17:00:00 40
09/02/2022 18:00:00 126
09/02/2022 19:00:00 282
09/02/2022 20:00:00 595
09/02/2022 21:00:00 948
09/02/2022 22:00:00 1389
09/02/2022 23:00:00 1577
10/02/2022 00:00:00 1609
10/02/2022 01:00:00 1362
10/02/2022 02:00:00 956
10/02/2022 03:00:00 624
10/02/2022 04:00:00 281
10/02/2022 05:00:00 134
10/02/2022 06:00:00 43
10/02/2022 07:00:00 16
10/02/2022 08:00:00 2
10/02/2022 10:00:00 1
20 rows selected.
If you need to get ALL hours, even if there was no data for a given hour, you can OUTER JOIN the raw data to a synthetic list of rows with all hours for the desired range, eg
SQL> with full_range as
2 ( select date '2022-02-09' + rownum/24 hr
3 from dual
4 connect by level <= 48
5 ),
6 raw_data as
7 ( select trunc(d,'HH24') dhr, count(*) cnt
8 from data
9 group by trunc(d,'HH24')
10 )
11 select full_range.hr, raw_data.cnt
12 from raw_data, full_range
13 where full_range.hr = raw_data.dhr(+)
14 order by 1;
HR CNT
------------------- ----------
09/02/2022 01:00:00
09/02/2022 02:00:00
09/02/2022 03:00:00
09/02/2022 04:00:00
09/02/2022 05:00:00
09/02/2022 06:00:00
09/02/2022 07:00:00
09/02/2022 08:00:00
09/02/2022 09:00:00
09/02/2022 10:00:00
09/02/2022 11:00:00
09/02/2022 12:00:00
09/02/2022 13:00:00 1
09/02/2022 14:00:00
09/02/2022 15:00:00 4
09/02/2022 16:00:00 10
09/02/2022 17:00:00 40
09/02/2022 18:00:00 126
09/02/2022 19:00:00 282
09/02/2022 20:00:00 595
09/02/2022 21:00:00 948
09/02/2022 22:00:00 1389
09/02/2022 23:00:00 1577
10/02/2022 00:00:00 1609
10/02/2022 01:00:00 1362
10/02/2022 02:00:00 956
10/02/2022 03:00:00 624
10/02/2022 04:00:00 281
10/02/2022 05:00:00 134
10/02/2022 06:00:00 43
10/02/2022 07:00:00 16
10/02/2022 08:00:00 2
10/02/2022 09:00:00
10/02/2022 10:00:00 1
10/02/2022 11:00:00
10/02/2022 12:00:00
10/02/2022 13:00:00
10/02/2022 14:00:00
10/02/2022 15:00:00
10/02/2022 16:00:00
10/02/2022 17:00:00
10/02/2022 18:00:00
10/02/2022 19:00:00
10/02/2022 20:00:00
10/02/2022 21:00:00
10/02/2022 22:00:00
10/02/2022 23:00:00
11/02/2022 00:00:00
48 rows selected.

Get max data for every day in BigQuery

I have a table with daily data by hour. I want to get a table with only one row per day. That row should have the max value for the column AforoTotal.
This is a part of the table, containing the records of three days.
FechaHora
Fecha
Hora
AforoTotal
2022-01-13T16:00:00Z
2022-01-13
16:00:00
4532
2022-01-13T15:00:00Z
2022-01-13
15:00:00
4419
2022-01-13T14:00:00Z
2022-01-13
14:00:00
4181
2022-01-13T13:00:00Z
2022-01-13
13:00:00
3914
2022-01-13T12:00:00Z
2022-01-13
12:00:00
3694
2022-01-13T11:00:00Z
2022-01-13
11:00:00
3268
2022-01-13T10:00:00Z
2022-01-13
10:00:00
2869
2022-01-13T09:00:00Z
2022-01-13
09:00:00
2065
2022-01-13T08:00:00Z
2022-01-13
08:00:00
1308
2022-01-13T07:00:00Z
2022-01-13
07:00:00
730
2022-01-13T06:00:00Z
2022-01-13
06:00:00
251
2022-01-13T05:00:00Z
2022-01-13
05:00:00
95
2022-01-13T04:00:00Z
2022-01-13
04:00:00
44
2022-01-13T03:00:00Z
2022-01-13
03:00:00
35
2022-01-13T02:00:00Z
2022-01-13
02:00:00
28
2022-01-13T01:00:00Z
2022-01-13
01:00:00
6
2022-01-13T00:00:00Z
2022-01-13
00:00:00
-18
2022-01-12T23:00:00Z
2022-01-12
23:00:00
1800
2022-01-12T22:00:00Z
2022-01-12
22:00:00
2042
2022-01-12T21:00:00Z
2022-01-12
21:00:00
2358
2022-01-12T20:00:00Z
2022-01-12
20:00:00
2827
2022-01-12T19:00:00Z
2022-01-12
19:00:00
3681
2022-01-12T18:00:00Z
2022-01-12
18:00:00
4306
2022-01-12T17:00:00Z
2022-01-12
17:00:00
4377
2022-01-12T16:00:00Z
2022-01-12
16:00:00
4428
2022-01-12T15:00:00Z
2022-01-12
15:00:00
4424
2022-01-12T14:00:00Z
2022-01-12
14:00:00
4010
2022-01-12T13:00:00Z
2022-01-12
13:00:00
3826
2022-01-12T12:00:00Z
2022-01-12
12:00:00
3582
2022-01-12T11:00:00Z
2022-01-12
11:00:00
3323
2022-01-12T10:00:00Z
2022-01-12
10:00:00
2805
2022-01-12T09:00:00Z
2022-01-12
09:00:00
2159
2022-01-12T08:00:00Z
2022-01-12
08:00:00
1378
2022-01-12T07:00:00Z
2022-01-12
07:00:00
790
2022-01-12T06:00:00Z
2022-01-12
06:00:00
317
2022-01-12T05:00:00Z
2022-01-12
05:00:00
160
2022-01-12T04:00:00Z
2022-01-12
04:00:00
106
2022-01-12T03:00:00Z
2022-01-12
03:00:00
95
2022-01-12T02:00:00Z
2022-01-12
02:00:00
86
2022-01-12T01:00:00Z
2022-01-12
01:00:00
39
2022-01-12T00:00:00Z
2022-01-12
00:00:00
0
2022-01-11T23:00:00Z
2022-01-11
23:00:00
2032
2022-01-11T22:00:00Z
2022-01-11
22:00:00
2109
2022-01-11T21:00:00Z
2022-01-11
21:00:00
2362
2022-01-11T20:00:00Z
2022-01-11
20:00:00
2866
2022-01-11T19:00:00Z
2022-01-11
19:00:00
3948
2022-01-11T18:00:00Z
2022-01-11
18:00:00
4532
2022-01-11T17:00:00Z
2022-01-11
17:00:00
4590
2022-01-11T16:00:00Z
2022-01-11
16:00:00
4821
2022-01-11T15:00:00Z
2022-01-11
15:00:00
4770
2022-01-11T14:00:00Z
2022-01-11
14:00:00
4405
2022-01-11T13:00:00Z
2022-01-11
13:00:00
4040
2022-01-11T12:00:00Z
2022-01-11
12:00:00
3847
2022-01-11T11:00:00Z
2022-01-11
11:00:00
3414
2022-01-11T10:00:00Z
2022-01-11
10:00:00
2940
2022-01-11T09:00:00Z
2022-01-11
09:00:00
2105
2022-01-11T08:00:00Z
2022-01-11
08:00:00
1353
2022-01-11T07:00:00Z
2022-01-11
07:00:00
739
2022-01-11T06:00:00Z
2022-01-11
06:00:00
248
2022-01-11T05:00:00Z
2022-01-11
05:00:00
91
2022-01-11T04:00:00Z
2022-01-11
04:00:00
63
2022-01-11T03:00:00Z
2022-01-11
03:00:00
46
2022-01-11T02:00:00Z
2022-01-11
02:00:00
42
2022-01-11T01:00:00Z
2022-01-11
01:00:00
18
2022-01-11T00:00:00Z
2022-01-11
00:00:00
5
My expected result is:
FechaHora
Fecha
Hora
AforoTotal
2022-01-13T16:00:00Z
2022-01-13
16:00:00
4532
2022-01-12T16:00:00Z
2022-01-12
16:00:00
4428
2022-01-11T17:00:00Z
2022-01-11
17:00:00
4590
Consider below approach
select as value
array_agg(t order by AforoTotal desc limit 1)[offset(0)]
from your_table t
group by Fecha
if to apply to sample data in your question - output is
Another way which is little bit costly:
It will be working when (Fetcha and max(AforoTotal)) combination is unique.
In given example, I find it is unique.
SELECT * FROM your_table
WHERE Fecha||AforoTotal
IN
(SELECT Fecha||MAX( AforoTotal ) FROM your_table GROUP BY Fecha);
[Output]
https://i.stack.imgur.com/IFzWA.jpg
thanks for your approach. This can be saved as a view in BigQuery and I can use it in DataStudio. I have not tested what happens when the combination is not unique, I will see how it behaves.
I think you can do something like this, though I haven't tested it:
SELECT LAST_VALUE(FetchaHora) OVER (Partition BY Fecha ORDER BY AforoTotal ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), Fetcha, LAST_VALUE(Hora) OVER (Partition BY Fecha ORDER BY AforoTotal ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), LAST_VALUE(AforoTotal) OVER (Partition BY Fecha ORDER BY AforoTotal ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AforoTotal FROM your_table

Add 10 to 40 minutes randomly to a datetime column in pandas

I have a data frame as shown below
start
2010-01-06 09:00:00
2018-01-07 08:00:00
2012-01-08 11:00:00
2016-01-07 08:00:00
2010-02-06 14:00:00
2018-01-07 16:00:00
To the above df, I would like to add a column called 'finish' by adding minutes between 10 to 40 with start column randomly with replacement.
Expected Ouput:
start finish
2010-01-06 09:00:00 2010-01-06 09:20:00
2018-01-07 08:00:00 2018-01-07 08:12:00
2012-01-08 11:00:00 2012-01-08 11:38:00
2016-01-07 08:00:00 2016-01-07 08:15:00
2010-02-06 14:00:00 2010-02-06 14:24:00
2018-01-07 16:00:00 2018-01-07 16:36:00
Create timedeltas by to_timedelta and numpy.random.randint for integers between 10 and 40:
arr = np.random.randint(10, 40, size=len(df))
df['finish'] = df['start'] + pd.to_timedelta(arr, unit='Min')
print (df)
start finish
0 2010-01-06 09:00:00 2010-01-06 09:25:00
1 2018-01-07 08:00:00 2018-01-07 08:30:00
2 2012-01-08 11:00:00 2012-01-08 11:29:00
3 2016-01-07 08:00:00 2016-01-07 08:12:00
4 2010-02-06 14:00:00 2010-02-06 14:31:00
5 2018-01-07 16:00:00 2018-01-07 16:39:00
You can achieve it by using pandas.Series.apply() in combination with pandas.to_timedelta() and random.randint().
from random import randint
df['finish'] = df.start.apply(lambda dt: dt + pd.to_timedelta(randint(10, 40), unit='m'))

Remove Duplicate Time Zones From SELECT query

I need help removing duplicate time zone records from my select query without deleting them. The current result is as follows:
Employee ID # Presence Presence Start Time Presence End Time GMT Presence Start Time
691 Out Of Office 2020-02-01 04:30:00 2020-02-01 14:30:00 2020-02-01 12:30:00
691 Out Of Office 2020-02-01 05:30:00 2020-02-01 15:30:00 2020-02-01 12:30:00
691 Out Of Office 2020-02-01 07:30:00 2020-02-01 17:30:00 2020-02-01 12:30:00
691 Out Of Office 2020-02-01 13:30:00 2020-02-01 23:30:00 2020-02-01 12:30:00
691 Out Of Office 2020-02-01 20:30:00 2020-02-02 06:30:00 2020-02-01 12:30:00
435 Out Of Office 2020-02-01 00:15:00 2020-02-01 09:00:00 2020-01-31 16:15:00
5681 Out Of Office 2020-02-02 07:00:00 2020-02-02 15:45:00 2020-02-02 15:00:00
5681 Out Of Office 2020-02-02 08:00:00 2020-02-02 16:45:00 2020-02-02 15:00:00
5681 Out Of Office 2020-02-02 10:00:00 2020-02-02 18:45:00 2020-02-02 15:00:00
5681 Out Of Office 2020-02-02 16:00:00 2020-02-03 00:45:00 2020-02-02 15:00:00
5681 Out Of Office 2020-02-02 23:00:00 2020-02-03 07:45:00 2020-02-02 15:00:00
1927 Out Of Office 2020-02-02 07:00:00 2020-02-02 18:15:00 2020-02-02 15:00:00
1927 Out Of Office 2020-02-02 08:00:00 2020-02-02 19:15:00 2020-02-02 15:00:00
1927 Out Of Office 2020-02-02 10:00:00 2020-02-02 21:15:00 2020-02-02 15:00:00
1927 Out Of Office 2020-02-02 16:00:00 2020-02-03 03:15:00 2020-02-02 15:00:00
1927 Out Of Office 2020-02-02 23:00:00 2020-02-03 10:15:00 2020-02-02 15:00:00
The table returns duplicate GMT start times for the same employee, the database appears to be duplicating the results based on different time zones.
I just want to remove the duplicate GMT Presence Start Times
Employee ID # 691 should have 1 row, same with 5681 and 1927. Can someone please help?
You can use a GROUP BY clause to give unique values of GMT Presence Start Time for each employees Presence in the log. You need to use an aggregation function on the Presence Start Time and Presence End Time functions; I've chosen MIN in my example, but you might want something else.
SELECT [Employee ID #],
[Presence],
MIN([Presence Start Time]),
MIN([Presence Start Time]),
[GMT Presence Start Time]
FROM data
GROUP BY [Employee ID #], [Presence], [GMT Presence Start Time]
Output (for your sample data):
Employee ID # Presence Presence Start Time Presence Start Time GMT Presence Start Time
435 Out Of Office 2020-02-01 00:15:00 2020-02-01 09:00:00 2020-01-31 16:15:00
691 Out Of Office 2020-02-01 04:30:00 2020-02-01 14:30:00 2020-02-01 12:30:00
1927 Out Of Office 2020-02-02 07:00:00 2020-02-02 18:15:00 2020-02-02 15:00:00
5681 Out Of Office 2020-02-02 07:00:00 2020-02-02 15:45:00 2020-02-02 15:00:00
Demo on SQLFiddle
Alternatively you can use window functions on the Presence Start and End times:
SELECT DISTINCT [Employee ID #],
[Presence],
FIRST_VALUE([Presence Start Time]) OVER (PARTITION BY [Employee ID #], [Presence], [GMT Presence Start Time] ORDER BY [Presence Start Time]) AS [Presence Start Time],
FIRST_VALUE([Presence End Time]) OVER (PARTITION BY [Employee ID #], [Presence], [GMT Presence Start Time] ORDER BY [Presence End Time]) AS [Presence End Time],
[GMT Presence Start Time]
FROM data
Output:
Employee ID # Presence Presence Start Time Presence End Time GMT Presence Start Time
435 Out Of Office 2020-02-01 00:15:00 2020-02-01 09:00:00 2020-01-31 16:15:00
691 Out Of Office 2020-02-01 04:30:00 2020-02-01 14:30:00 2020-02-01 12:30:00
1927 Out Of Office 2020-02-02 07:00:00 2020-02-02 18:15:00 2020-02-02 15:00:00
5681 Out Of Office 2020-02-02 07:00:00 2020-02-02 15:45:00 2020-02-02 15:00:00
Demo on SQLFiddle

Doing a group by on the given data set for stored procedure

I have a table with sample data as below:
systemuid filename mindatetime maxdatetime
10006 monitor_7.dat 2019-06-05 03:06:18.001 AM 2019-06-06 03:06:11.0 AM
72111 monitor_4.dat 2019-04-28 09:00:00 AM 2019-04-29 11:00:00 AM
10006 monitor_5.dat 2019-04-28 07:00:00 AM 2019-04-28 10:00:00 AM
90204 monitor_7.dat 2019-05-24 03:06:11.001 AM 2019-06-05 03:06:18.0 AM
90204 monitor_4.dat 2019-04-28 09:30:00 AM 2019-04-29 23:00:00 PM
72111 monitor_7.dat 2019-04-21 03:06:26.0 AM 2019-05-21 03:06:10.0 AM
10006 monitor_5.dat 2019-04-28 02:00:00 PM 2019-04-28 06:00:00 PM
72111 monitor_7.dat 2019-05-12 07:00:10.001 AM 2019-05-13 10:00:10.000 AM
90204 monitor_5.dat 2019-04-28 09:00:00 AM 2019-04-28 03:00:00 PM
10006 monitor_7.dat 2019-05-15 09:30:10.001 AM 2019-05-18 11:30:10.000 AM
72111 monitor_4.dat 2019-04-28 07:00:00 AM 2019-04-29 11:00:00 AM
10006 monitor_7.dat 2019-05-21 03:06:10.001 AM 2019-05-24 03:06:11.0 AM
I want to organize the data by grouping the systemuid and filename and then order by mindatetime, maxdatetime. Each systemuid will have multiple filenames with each filename having multiple timestamps.
systemuid filename mindatetime maxdatetime
10006 monitor_5.dat 2019-04-28 07:00:00 AM 2019-04-28 10:00:00 AM
10006 monitor_5.dat 2019-04-28 02:00:00 PM 2019-04-28 06:00:00 PM
10006 monitor_7.dat 2019-05-15 09:30:10.001 AM 2019-05-18 11:30:10.000 AM
10006 monitor_7.dat 2019-05-21 03:06:10.001 AM 2019-05-24 03:06:11.0 AM
10006 monitor_7.dat 2019-06-05 03:06:18.001 AM 2019-06-06 03:06:11.0 AM
72111 monitor_4.dat 2019-04-28 07:00:00 AM 2019-04-29 11:00:00 AM
72111 monitor_4.dat 2019-04-28 09:00:00 AM 2019-04-29 11:00:00 AM
72111 monitor_7.dat 2019-04-21 03:06:26.0 AM 2019-05-21 03:06:10.0 AM
72111 monitor_7.dat 2019-05-12 07:00:10.001 AM 2019-05-13 10:00:10.000 AM
90204 monitor_4.dat 2019-04-28 09:30:00 AM 2019-04-29 23:00:00 PM
90204 monitor_5.dat 2019-04-28 09:00:00 AM 2019-04-28 03:00:00 PM
90204 monitor_7.dat 2019-05-24 03:06:11.001 AM 2019-06-05 03:06:18.0 AM
I need this as a cursor for my stored procedure. So need the data to be in this format to perform the functionality on the records. The table size is pretty huge with millions of records.
Just use window functions:
order by systemuid, filename, mindatetime, maxdatetime
If you are concerned about performance on large datasets, be sure you have an index on (systemuid, filename, mindatetime, maxdatetime).
Regardless of whether or not you have the index, it is probably faster to do the ordering in the database rather than in the application.