I have a data frame as shown below
start
2010-01-06 09:00:00
2018-01-07 08:00:00
2012-01-08 11:00:00
2016-01-07 08:00:00
2010-02-06 14:00:00
2018-01-07 16:00:00
To the above df, I would like to add a column called 'finish' by adding minutes between 10 to 40 with start column randomly with replacement.
Expected Ouput:
start finish
2010-01-06 09:00:00 2010-01-06 09:20:00
2018-01-07 08:00:00 2018-01-07 08:12:00
2012-01-08 11:00:00 2012-01-08 11:38:00
2016-01-07 08:00:00 2016-01-07 08:15:00
2010-02-06 14:00:00 2010-02-06 14:24:00
2018-01-07 16:00:00 2018-01-07 16:36:00
Create timedeltas by to_timedelta and numpy.random.randint for integers between 10 and 40:
arr = np.random.randint(10, 40, size=len(df))
df['finish'] = df['start'] + pd.to_timedelta(arr, unit='Min')
print (df)
start finish
0 2010-01-06 09:00:00 2010-01-06 09:25:00
1 2018-01-07 08:00:00 2018-01-07 08:30:00
2 2012-01-08 11:00:00 2012-01-08 11:29:00
3 2016-01-07 08:00:00 2016-01-07 08:12:00
4 2010-02-06 14:00:00 2010-02-06 14:31:00
5 2018-01-07 16:00:00 2018-01-07 16:39:00
You can achieve it by using pandas.Series.apply() in combination with pandas.to_timedelta() and random.randint().
from random import randint
df['finish'] = df.start.apply(lambda dt: dt + pd.to_timedelta(randint(10, 40), unit='m'))
Related
I have a table with daily data by hour. I want to get a table with only one row per day. That row should have the max value for the column AforoTotal.
This is a part of the table, containing the records of three days.
FechaHora
Fecha
Hora
AforoTotal
2022-01-13T16:00:00Z
2022-01-13
16:00:00
4532
2022-01-13T15:00:00Z
2022-01-13
15:00:00
4419
2022-01-13T14:00:00Z
2022-01-13
14:00:00
4181
2022-01-13T13:00:00Z
2022-01-13
13:00:00
3914
2022-01-13T12:00:00Z
2022-01-13
12:00:00
3694
2022-01-13T11:00:00Z
2022-01-13
11:00:00
3268
2022-01-13T10:00:00Z
2022-01-13
10:00:00
2869
2022-01-13T09:00:00Z
2022-01-13
09:00:00
2065
2022-01-13T08:00:00Z
2022-01-13
08:00:00
1308
2022-01-13T07:00:00Z
2022-01-13
07:00:00
730
2022-01-13T06:00:00Z
2022-01-13
06:00:00
251
2022-01-13T05:00:00Z
2022-01-13
05:00:00
95
2022-01-13T04:00:00Z
2022-01-13
04:00:00
44
2022-01-13T03:00:00Z
2022-01-13
03:00:00
35
2022-01-13T02:00:00Z
2022-01-13
02:00:00
28
2022-01-13T01:00:00Z
2022-01-13
01:00:00
6
2022-01-13T00:00:00Z
2022-01-13
00:00:00
-18
2022-01-12T23:00:00Z
2022-01-12
23:00:00
1800
2022-01-12T22:00:00Z
2022-01-12
22:00:00
2042
2022-01-12T21:00:00Z
2022-01-12
21:00:00
2358
2022-01-12T20:00:00Z
2022-01-12
20:00:00
2827
2022-01-12T19:00:00Z
2022-01-12
19:00:00
3681
2022-01-12T18:00:00Z
2022-01-12
18:00:00
4306
2022-01-12T17:00:00Z
2022-01-12
17:00:00
4377
2022-01-12T16:00:00Z
2022-01-12
16:00:00
4428
2022-01-12T15:00:00Z
2022-01-12
15:00:00
4424
2022-01-12T14:00:00Z
2022-01-12
14:00:00
4010
2022-01-12T13:00:00Z
2022-01-12
13:00:00
3826
2022-01-12T12:00:00Z
2022-01-12
12:00:00
3582
2022-01-12T11:00:00Z
2022-01-12
11:00:00
3323
2022-01-12T10:00:00Z
2022-01-12
10:00:00
2805
2022-01-12T09:00:00Z
2022-01-12
09:00:00
2159
2022-01-12T08:00:00Z
2022-01-12
08:00:00
1378
2022-01-12T07:00:00Z
2022-01-12
07:00:00
790
2022-01-12T06:00:00Z
2022-01-12
06:00:00
317
2022-01-12T05:00:00Z
2022-01-12
05:00:00
160
2022-01-12T04:00:00Z
2022-01-12
04:00:00
106
2022-01-12T03:00:00Z
2022-01-12
03:00:00
95
2022-01-12T02:00:00Z
2022-01-12
02:00:00
86
2022-01-12T01:00:00Z
2022-01-12
01:00:00
39
2022-01-12T00:00:00Z
2022-01-12
00:00:00
0
2022-01-11T23:00:00Z
2022-01-11
23:00:00
2032
2022-01-11T22:00:00Z
2022-01-11
22:00:00
2109
2022-01-11T21:00:00Z
2022-01-11
21:00:00
2362
2022-01-11T20:00:00Z
2022-01-11
20:00:00
2866
2022-01-11T19:00:00Z
2022-01-11
19:00:00
3948
2022-01-11T18:00:00Z
2022-01-11
18:00:00
4532
2022-01-11T17:00:00Z
2022-01-11
17:00:00
4590
2022-01-11T16:00:00Z
2022-01-11
16:00:00
4821
2022-01-11T15:00:00Z
2022-01-11
15:00:00
4770
2022-01-11T14:00:00Z
2022-01-11
14:00:00
4405
2022-01-11T13:00:00Z
2022-01-11
13:00:00
4040
2022-01-11T12:00:00Z
2022-01-11
12:00:00
3847
2022-01-11T11:00:00Z
2022-01-11
11:00:00
3414
2022-01-11T10:00:00Z
2022-01-11
10:00:00
2940
2022-01-11T09:00:00Z
2022-01-11
09:00:00
2105
2022-01-11T08:00:00Z
2022-01-11
08:00:00
1353
2022-01-11T07:00:00Z
2022-01-11
07:00:00
739
2022-01-11T06:00:00Z
2022-01-11
06:00:00
248
2022-01-11T05:00:00Z
2022-01-11
05:00:00
91
2022-01-11T04:00:00Z
2022-01-11
04:00:00
63
2022-01-11T03:00:00Z
2022-01-11
03:00:00
46
2022-01-11T02:00:00Z
2022-01-11
02:00:00
42
2022-01-11T01:00:00Z
2022-01-11
01:00:00
18
2022-01-11T00:00:00Z
2022-01-11
00:00:00
5
My expected result is:
FechaHora
Fecha
Hora
AforoTotal
2022-01-13T16:00:00Z
2022-01-13
16:00:00
4532
2022-01-12T16:00:00Z
2022-01-12
16:00:00
4428
2022-01-11T17:00:00Z
2022-01-11
17:00:00
4590
Consider below approach
select as value
array_agg(t order by AforoTotal desc limit 1)[offset(0)]
from your_table t
group by Fecha
if to apply to sample data in your question - output is
Another way which is little bit costly:
It will be working when (Fetcha and max(AforoTotal)) combination is unique.
In given example, I find it is unique.
SELECT * FROM your_table
WHERE Fecha||AforoTotal
IN
(SELECT Fecha||MAX( AforoTotal ) FROM your_table GROUP BY Fecha);
[Output]
https://i.stack.imgur.com/IFzWA.jpg
thanks for your approach. This can be saved as a view in BigQuery and I can use it in DataStudio. I have not tested what happens when the combination is not unique, I will see how it behaves.
I think you can do something like this, though I haven't tested it:
SELECT LAST_VALUE(FetchaHora) OVER (Partition BY Fecha ORDER BY AforoTotal ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), Fetcha, LAST_VALUE(Hora) OVER (Partition BY Fecha ORDER BY AforoTotal ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), LAST_VALUE(AforoTotal) OVER (Partition BY Fecha ORDER BY AforoTotal ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AforoTotal FROM your_table
I have a dataframe
ID datetime
11 01-09-2021 10:00:00
11 01-09-2021 10:15:15
11 01-09-2021 15:00:00
12 01-09-2021 15:10:00
11 01-09-2021 18:00:00
I need to add period based just on datetime if it increases to 2 hours
ID datetime period
11 01-09-2021 10:00:00 1
11 01-09-2021 10:15:15 1
11 01-09-2021 15:00:00 2
12 01-09-2021 15:10:00 2
11 01-09-2021 18:00:00 3
And the same thing but based on ID and datetime
ID datetime period
11 01-09-2021 10:00:00 1
11 01-09-2021 10:15:15 1
11 01-09-2021 15:00:00 2
12 01-09-2021 15:10:00 1
11 01-09-2021 18:00:00 3
How can I do that?
You can get difference by Series.diff, convert to hours Series.dt.total_seconds, comapre for 2 and add cumulative sum:
df['period'] = df['datetime'].diff().dt.total_seconds().div(3600).gt(2).cumsum().add(1)
print (df)
ID datetime period
0 11 2021-01-09 10:00:00 1
1 11 2021-01-09 10:15:15 1
2 11 2021-01-09 15:00:00 2
3 12 2021-01-09 15:10:00 2
4 11 2021-01-09 18:00:00 3
Similar idea per groups:
f = lambda x: x.diff().dt.total_seconds().div(3600).gt(2).cumsum().add(1)
df['period'] = df.groupby('ID')['datetime'].transform(f)
print (df)
ID datetime period
0 11 2021-01-09 10:00:00 1
1 11 2021-01-09 10:15:15 1
2 11 2021-01-09 15:00:00 2
3 12 2021-01-09 15:10:00 1
4 11 2021-01-09 18:00:00 3
I have a data frame as shown below
ID ideal_appt_time service_time
1 2020-01-06 09:00:00 22
2 2020-01-06 09:30:00 15
1 2020-01-08 14:00:00 42
2 2020-01-12 01:30:00 5
I would like to add service time in terms of minutes with ideal_appt_time and create new column called finish.
Expected Output:
ID ideal_appt_time service_time finish
1 2020-01-06 09:00:00 22 2020-01-06 09:22:00
2 2020-01-06 09:30:00 15 2020-01-06 09:45:00
1 2020-01-08 14:00:00 42 2020-01-08 14:42:00
2 2020-01-12 01:30:00 35 2020-01-12 02:05:00
Use to_timedelta for convert column to timedeltas by minutes and add to datetimes:
df['ideal_appt_time'] = pd.to_datetime(df['ideal_appt_time'])
df['finish'] = df['ideal_appt_time'] + pd.to_timedelta(df['service_time'], unit='Min')
print (df)
ID ideal_appt_time service_time finish
0 1 2020-01-06 09:00:00 22 2020-01-06 09:22:00
1 2 2020-01-06 09:30:00 15 2020-01-06 09:45:00
2 1 2020-01-08 14:00:00 42 2020-01-08 14:42:00
3 2 2020-01-12 01:30:00 5 2020-01-12 01:35:00
Data
df=pd.DataFrame({'ideal_appt_time':['2020-01-06 09:00:00','2020-01-06 09:30:00','2020-01-08 14:00:00','2020-01-12 01:30:00'],'service_time':[22,15,42,35]})
Another way out
df['finish'] = pd.to_datetime(df['ideal_appt_time']).add( df['service_time'].astype('timedelta64[m]'))
df
ideal_appt_time service_time finish
0 2020-01-06 09:00:00 22 2020-01-06 09:22:00
1 2020-01-06 09:30:00 15 2020-01-06 09:45:00
2 2020-01-08 14:00:00 42 2020-01-08 14:42:00
3 2020-01-12 01:30:00 35 2020-01-12 02:05:00
Would like to spread the values of the 15 minute intervals evenly over the 5 minute intervals. But cannot get it to work. Data is:
Datetime a
2018-01-01 00:00:00 6
2018-01-01 00:15:00 3
2018-01-01 00:30:00 9
Desired output would be:
Datetime a
2018-01-01 00:00:00 2
2018-01-11 00:05:00 2
2018-01-11 00:10:00 2
2018-01-11 00:15:00 1
2018-01-11 00:20:00 1
2018-01-11 00:25:00 1
2018-01-11 00:30:00 3
2018-01-11 00:35:00 3
2018-01-11 00:40:00 3
perhaps unnecessarily, but the value '6' of 00:00:00 in the data is spread over the intervals 00:00:00-00:10:00
Slightly different approach:
# convert to datetime
df.Datetime = pd.to_datetime(df.Datetime)
# set Datetime as index
df.set_index('Datetime', inplace=True)
# add one extra row
df.loc[df.index.max()+pd.to_timedelta('10min')] = 0
# set_index and resample
s = df.asfreq('5T', fill_value=0)
# transform the 0's to mean:
(s.groupby(s['a'].ne(0)
.cumsum())
.transform('mean')
.reset_index()
)
Output:
Datetime a
0 2018-01-01 00:00:00 2
1 2018-01-01 00:05:00 2
2 2018-01-01 00:10:00 2
3 2018-01-01 00:15:00 1
4 2018-01-01 00:20:00 1
5 2018-01-01 00:25:00 1
6 2018-01-01 00:30:00 3
7 2018-01-01 00:35:00 3
8 2018-01-01 00:40:00 3
I have the following dataframe df.
id start finish location
0 1 2015-12-14 16:44:00 2015-12-15 18:00:00 A
1 1 2015-12-15 18:00:00 2015-12-16 13:00:00 B
2 1 2015-12-16 13:00:00 2015-12-16 20:00:00 C
3 2 2015-12-10 13:15:00 2015-12-12 13:45:00 B
4 2 2015-12-12 13:45:00 2015-12-12 19:45:00 A
5 3 2015-12-15 07:45:00 2015-12-15 18:45:00 A
6 3 2015-12-15 18:45:00 2015-12-18 07:15:00 D
7 3 2015-12-18 07:15:00 2015-12-19 10:45:00 C
8 3 2015-12-19 10:45:00 2015-12-20 09:00:00 H
I wanted to find the the id_start_date and id_end_date for every id.
In the above example, there are start and finish dates for every line. I want to have two new columns id_start_date and id_end_date. In the id_start_date column, I want to find the earliest most date in the start column specific to every id. This is easy. I can first sort the data based on id and start, then I can just pick the first start date in every id or I can do group-by based on id and later use aggregate function to find the minimum date in the start column. For the id_end_date, I can do the same. I can group-by based on id and use aggregate function to find the maximum date in the finish column.
df1 = df.sort_values(['id','start'],ascending=True)
gp = df1.groupby('id')
gp_out = gp.agg({'start': {'mindate': np.min}, 'finish': {'maxdate': np.max}})
when I print gp_out, It does show the correct dates but how would I write them back to the original dataframe df. I expect the following:
id start finish location id_start_date id_end_date
0 1 2015-12-14 16:44:00 2015-12-15 18:00:00 A 2015-12-14 16:44:00 2015-12-16 20:00:00
1 1 2015-12-15 18:00:00 2015-12-16 13:00:00 B 2015-12-14 16:44:00 2015-12-16 20:00:00
2 1 2015-12-16 13:00:00 2015-12-16 20:00:00 C 2015-12-14 16:44:00 2015-12-16 20:00:00
3 2 2015-12-10 13:15:00 2015-12-12 13:45:00 B 2015-12-10 13:15:00 2015-12-12 19:45:00
4 2 2015-12-12 13:45:00 2015-12-12 19:45:00 A 2015-12-10 13:15:00 2015-12-12 19:45:00
5 3 2015-12-15 07:45:00 2015-12-15 18:45:00 A 2015-12-15 07:45:00 2015-12-20 09:00:00
6 3 2015-12-15 18:45:00 2015-12-18 07:15:00 D 2015-12-15 07:45:00 2015-12-20 09:00:00
7 3 2015-12-18 07:15:00 2015-12-19 10:45:00 C 2015-12-15 07:45:00 2015-12-20 09:00:00
8 3 2015-12-19 10:45:00 2015-12-20 09:00:00 H 2015-12-15 07:45:00 2015-12-20 09:00:00
How can i get the last two columns into the original dataframe df?
Using transform
g=df.groupby('id')
df['id_start_date']=g['start'].transform('min')
df['id_end_date']=g['finish'].transform('max')