find the earliest and latest dates between two columns - pandas

I have the following dataframe df.
id start finish location
0 1 2015-12-14 16:44:00 2015-12-15 18:00:00 A
1 1 2015-12-15 18:00:00 2015-12-16 13:00:00 B
2 1 2015-12-16 13:00:00 2015-12-16 20:00:00 C
3 2 2015-12-10 13:15:00 2015-12-12 13:45:00 B
4 2 2015-12-12 13:45:00 2015-12-12 19:45:00 A
5 3 2015-12-15 07:45:00 2015-12-15 18:45:00 A
6 3 2015-12-15 18:45:00 2015-12-18 07:15:00 D
7 3 2015-12-18 07:15:00 2015-12-19 10:45:00 C
8 3 2015-12-19 10:45:00 2015-12-20 09:00:00 H
I wanted to find the the id_start_date and id_end_date for every id.
In the above example, there are start and finish dates for every line. I want to have two new columns id_start_date and id_end_date. In the id_start_date column, I want to find the earliest most date in the start column specific to every id. This is easy. I can first sort the data based on id and start, then I can just pick the first start date in every id or I can do group-by based on id and later use aggregate function to find the minimum date in the start column. For the id_end_date, I can do the same. I can group-by based on id and use aggregate function to find the maximum date in the finish column.
df1 = df.sort_values(['id','start'],ascending=True)
gp = df1.groupby('id')
gp_out = gp.agg({'start': {'mindate': np.min}, 'finish': {'maxdate': np.max}})
when I print gp_out, It does show the correct dates but how would I write them back to the original dataframe df. I expect the following:
id start finish location id_start_date id_end_date
0 1 2015-12-14 16:44:00 2015-12-15 18:00:00 A 2015-12-14 16:44:00 2015-12-16 20:00:00
1 1 2015-12-15 18:00:00 2015-12-16 13:00:00 B 2015-12-14 16:44:00 2015-12-16 20:00:00
2 1 2015-12-16 13:00:00 2015-12-16 20:00:00 C 2015-12-14 16:44:00 2015-12-16 20:00:00
3 2 2015-12-10 13:15:00 2015-12-12 13:45:00 B 2015-12-10 13:15:00 2015-12-12 19:45:00
4 2 2015-12-12 13:45:00 2015-12-12 19:45:00 A 2015-12-10 13:15:00 2015-12-12 19:45:00
5 3 2015-12-15 07:45:00 2015-12-15 18:45:00 A 2015-12-15 07:45:00 2015-12-20 09:00:00
6 3 2015-12-15 18:45:00 2015-12-18 07:15:00 D 2015-12-15 07:45:00 2015-12-20 09:00:00
7 3 2015-12-18 07:15:00 2015-12-19 10:45:00 C 2015-12-15 07:45:00 2015-12-20 09:00:00
8 3 2015-12-19 10:45:00 2015-12-20 09:00:00 H 2015-12-15 07:45:00 2015-12-20 09:00:00
How can i get the last two columns into the original dataframe df?

Using transform
g=df.groupby('id')
df['id_start_date']=g['start'].transform('min')
df['id_end_date']=g['finish'].transform('max')

Related

Get max data for every day in BigQuery

I have a table with daily data by hour. I want to get a table with only one row per day. That row should have the max value for the column AforoTotal.
This is a part of the table, containing the records of three days.
FechaHora
Fecha
Hora
AforoTotal
2022-01-13T16:00:00Z
2022-01-13
16:00:00
4532
2022-01-13T15:00:00Z
2022-01-13
15:00:00
4419
2022-01-13T14:00:00Z
2022-01-13
14:00:00
4181
2022-01-13T13:00:00Z
2022-01-13
13:00:00
3914
2022-01-13T12:00:00Z
2022-01-13
12:00:00
3694
2022-01-13T11:00:00Z
2022-01-13
11:00:00
3268
2022-01-13T10:00:00Z
2022-01-13
10:00:00
2869
2022-01-13T09:00:00Z
2022-01-13
09:00:00
2065
2022-01-13T08:00:00Z
2022-01-13
08:00:00
1308
2022-01-13T07:00:00Z
2022-01-13
07:00:00
730
2022-01-13T06:00:00Z
2022-01-13
06:00:00
251
2022-01-13T05:00:00Z
2022-01-13
05:00:00
95
2022-01-13T04:00:00Z
2022-01-13
04:00:00
44
2022-01-13T03:00:00Z
2022-01-13
03:00:00
35
2022-01-13T02:00:00Z
2022-01-13
02:00:00
28
2022-01-13T01:00:00Z
2022-01-13
01:00:00
6
2022-01-13T00:00:00Z
2022-01-13
00:00:00
-18
2022-01-12T23:00:00Z
2022-01-12
23:00:00
1800
2022-01-12T22:00:00Z
2022-01-12
22:00:00
2042
2022-01-12T21:00:00Z
2022-01-12
21:00:00
2358
2022-01-12T20:00:00Z
2022-01-12
20:00:00
2827
2022-01-12T19:00:00Z
2022-01-12
19:00:00
3681
2022-01-12T18:00:00Z
2022-01-12
18:00:00
4306
2022-01-12T17:00:00Z
2022-01-12
17:00:00
4377
2022-01-12T16:00:00Z
2022-01-12
16:00:00
4428
2022-01-12T15:00:00Z
2022-01-12
15:00:00
4424
2022-01-12T14:00:00Z
2022-01-12
14:00:00
4010
2022-01-12T13:00:00Z
2022-01-12
13:00:00
3826
2022-01-12T12:00:00Z
2022-01-12
12:00:00
3582
2022-01-12T11:00:00Z
2022-01-12
11:00:00
3323
2022-01-12T10:00:00Z
2022-01-12
10:00:00
2805
2022-01-12T09:00:00Z
2022-01-12
09:00:00
2159
2022-01-12T08:00:00Z
2022-01-12
08:00:00
1378
2022-01-12T07:00:00Z
2022-01-12
07:00:00
790
2022-01-12T06:00:00Z
2022-01-12
06:00:00
317
2022-01-12T05:00:00Z
2022-01-12
05:00:00
160
2022-01-12T04:00:00Z
2022-01-12
04:00:00
106
2022-01-12T03:00:00Z
2022-01-12
03:00:00
95
2022-01-12T02:00:00Z
2022-01-12
02:00:00
86
2022-01-12T01:00:00Z
2022-01-12
01:00:00
39
2022-01-12T00:00:00Z
2022-01-12
00:00:00
0
2022-01-11T23:00:00Z
2022-01-11
23:00:00
2032
2022-01-11T22:00:00Z
2022-01-11
22:00:00
2109
2022-01-11T21:00:00Z
2022-01-11
21:00:00
2362
2022-01-11T20:00:00Z
2022-01-11
20:00:00
2866
2022-01-11T19:00:00Z
2022-01-11
19:00:00
3948
2022-01-11T18:00:00Z
2022-01-11
18:00:00
4532
2022-01-11T17:00:00Z
2022-01-11
17:00:00
4590
2022-01-11T16:00:00Z
2022-01-11
16:00:00
4821
2022-01-11T15:00:00Z
2022-01-11
15:00:00
4770
2022-01-11T14:00:00Z
2022-01-11
14:00:00
4405
2022-01-11T13:00:00Z
2022-01-11
13:00:00
4040
2022-01-11T12:00:00Z
2022-01-11
12:00:00
3847
2022-01-11T11:00:00Z
2022-01-11
11:00:00
3414
2022-01-11T10:00:00Z
2022-01-11
10:00:00
2940
2022-01-11T09:00:00Z
2022-01-11
09:00:00
2105
2022-01-11T08:00:00Z
2022-01-11
08:00:00
1353
2022-01-11T07:00:00Z
2022-01-11
07:00:00
739
2022-01-11T06:00:00Z
2022-01-11
06:00:00
248
2022-01-11T05:00:00Z
2022-01-11
05:00:00
91
2022-01-11T04:00:00Z
2022-01-11
04:00:00
63
2022-01-11T03:00:00Z
2022-01-11
03:00:00
46
2022-01-11T02:00:00Z
2022-01-11
02:00:00
42
2022-01-11T01:00:00Z
2022-01-11
01:00:00
18
2022-01-11T00:00:00Z
2022-01-11
00:00:00
5
My expected result is:
FechaHora
Fecha
Hora
AforoTotal
2022-01-13T16:00:00Z
2022-01-13
16:00:00
4532
2022-01-12T16:00:00Z
2022-01-12
16:00:00
4428
2022-01-11T17:00:00Z
2022-01-11
17:00:00
4590
Consider below approach
select as value
array_agg(t order by AforoTotal desc limit 1)[offset(0)]
from your_table t
group by Fecha
if to apply to sample data in your question - output is
Another way which is little bit costly:
It will be working when (Fetcha and max(AforoTotal)) combination is unique.
In given example, I find it is unique.
SELECT * FROM your_table
WHERE Fecha||AforoTotal
IN
(SELECT Fecha||MAX( AforoTotal ) FROM your_table GROUP BY Fecha);
[Output]
https://i.stack.imgur.com/IFzWA.jpg
thanks for your approach. This can be saved as a view in BigQuery and I can use it in DataStudio. I have not tested what happens when the combination is not unique, I will see how it behaves.
I think you can do something like this, though I haven't tested it:
SELECT LAST_VALUE(FetchaHora) OVER (Partition BY Fecha ORDER BY AforoTotal ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), Fetcha, LAST_VALUE(Hora) OVER (Partition BY Fecha ORDER BY AforoTotal ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), LAST_VALUE(AforoTotal) OVER (Partition BY Fecha ORDER BY AforoTotal ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AforoTotal FROM your_table

Pandas: create a period based on date column

I have a dataframe
ID datetime
11 01-09-2021 10:00:00
11 01-09-2021 10:15:15
11 01-09-2021 15:00:00
12 01-09-2021 15:10:00
11 01-09-2021 18:00:00
I need to add period based just on datetime if it increases to 2 hours
ID datetime period
11 01-09-2021 10:00:00 1
11 01-09-2021 10:15:15 1
11 01-09-2021 15:00:00 2
12 01-09-2021 15:10:00 2
11 01-09-2021 18:00:00 3
And the same thing but based on ID and datetime
ID datetime period
11 01-09-2021 10:00:00 1
11 01-09-2021 10:15:15 1
11 01-09-2021 15:00:00 2
12 01-09-2021 15:10:00 1
11 01-09-2021 18:00:00 3
How can I do that?
You can get difference by Series.diff, convert to hours Series.dt.total_seconds, comapre for 2 and add cumulative sum:
df['period'] = df['datetime'].diff().dt.total_seconds().div(3600).gt(2).cumsum().add(1)
print (df)
ID datetime period
0 11 2021-01-09 10:00:00 1
1 11 2021-01-09 10:15:15 1
2 11 2021-01-09 15:00:00 2
3 12 2021-01-09 15:10:00 2
4 11 2021-01-09 18:00:00 3
Similar idea per groups:
f = lambda x: x.diff().dt.total_seconds().div(3600).gt(2).cumsum().add(1)
df['period'] = df.groupby('ID')['datetime'].transform(f)
print (df)
ID datetime period
0 11 2021-01-09 10:00:00 1
1 11 2021-01-09 10:15:15 1
2 11 2021-01-09 15:00:00 2
3 12 2021-01-09 15:10:00 1
4 11 2021-01-09 18:00:00 3

Add a column value with the other date time column at minutes level in pandas

I have a data frame as shown below
ID ideal_appt_time service_time
1 2020-01-06 09:00:00 22
2 2020-01-06 09:30:00 15
1 2020-01-08 14:00:00 42
2 2020-01-12 01:30:00 5
I would like to add service time in terms of minutes with ideal_appt_time and create new column called finish.
Expected Output:
ID ideal_appt_time service_time finish
1 2020-01-06 09:00:00 22 2020-01-06 09:22:00
2 2020-01-06 09:30:00 15 2020-01-06 09:45:00
1 2020-01-08 14:00:00 42 2020-01-08 14:42:00
2 2020-01-12 01:30:00 35 2020-01-12 02:05:00
Use to_timedelta for convert column to timedeltas by minutes and add to datetimes:
df['ideal_appt_time'] = pd.to_datetime(df['ideal_appt_time'])
df['finish'] = df['ideal_appt_time'] + pd.to_timedelta(df['service_time'], unit='Min')
print (df)
ID ideal_appt_time service_time finish
0 1 2020-01-06 09:00:00 22 2020-01-06 09:22:00
1 2 2020-01-06 09:30:00 15 2020-01-06 09:45:00
2 1 2020-01-08 14:00:00 42 2020-01-08 14:42:00
3 2 2020-01-12 01:30:00 5 2020-01-12 01:35:00
Data
df=pd.DataFrame({'ideal_appt_time':['2020-01-06 09:00:00','2020-01-06 09:30:00','2020-01-08 14:00:00','2020-01-12 01:30:00'],'service_time':[22,15,42,35]})
Another way out
df['finish'] = pd.to_datetime(df['ideal_appt_time']).add( df['service_time'].astype('timedelta64[m]'))
df
ideal_appt_time service_time finish
0 2020-01-06 09:00:00 22 2020-01-06 09:22:00
1 2020-01-06 09:30:00 15 2020-01-06 09:45:00
2 2020-01-08 14:00:00 42 2020-01-08 14:42:00
3 2020-01-12 01:30:00 35 2020-01-12 02:05:00

Add 10 to 40 minutes randomly to a datetime column in pandas

I have a data frame as shown below
start
2010-01-06 09:00:00
2018-01-07 08:00:00
2012-01-08 11:00:00
2016-01-07 08:00:00
2010-02-06 14:00:00
2018-01-07 16:00:00
To the above df, I would like to add a column called 'finish' by adding minutes between 10 to 40 with start column randomly with replacement.
Expected Ouput:
start finish
2010-01-06 09:00:00 2010-01-06 09:20:00
2018-01-07 08:00:00 2018-01-07 08:12:00
2012-01-08 11:00:00 2012-01-08 11:38:00
2016-01-07 08:00:00 2016-01-07 08:15:00
2010-02-06 14:00:00 2010-02-06 14:24:00
2018-01-07 16:00:00 2018-01-07 16:36:00
Create timedeltas by to_timedelta and numpy.random.randint for integers between 10 and 40:
arr = np.random.randint(10, 40, size=len(df))
df['finish'] = df['start'] + pd.to_timedelta(arr, unit='Min')
print (df)
start finish
0 2010-01-06 09:00:00 2010-01-06 09:25:00
1 2018-01-07 08:00:00 2018-01-07 08:30:00
2 2012-01-08 11:00:00 2012-01-08 11:29:00
3 2016-01-07 08:00:00 2016-01-07 08:12:00
4 2010-02-06 14:00:00 2010-02-06 14:31:00
5 2018-01-07 16:00:00 2018-01-07 16:39:00
You can achieve it by using pandas.Series.apply() in combination with pandas.to_timedelta() and random.randint().
from random import randint
df['finish'] = df.start.apply(lambda dt: dt + pd.to_timedelta(randint(10, 40), unit='m'))

count jumps from one location to another based on conditions

I have the following dataframe.
id start finish location
0 1 2015-12-14 16:44:00 2015-12-15 18:00:00 A
1 1 2015-12-15 18:00:00 2015-12-16 13:00:00 B
2 1 2015-12-16 13:00:00 2015-12-16 20:00:00 C
3 2 2015-12-10 13:15:00 2015-12-12 13:45:00 B
4 2 2015-12-12 13:45:00 2015-12-12 19:45:00 A
5 3 2015-12-15 07:45:00 2015-12-15 18:45:00 A
6 3 2015-12-15 18:45:00 2015-12-18 07:15:00 D
7 3 2015-12-18 07:15:00 2015-12-19 10:45:00 C
8 3 2015-12-19 10:45:00 2015-12-20 09:00:00 H
9 4 2015-12-09 10:45:00 2015-12-13 12:20:00 E
10 4 2015-12-13 12:20:00 2015-12-13 18:20:00 A
11 4 2015-12-13 18:20:00 2015-12-13 23:40:00 A
12 4 2015-12-13 23:40:00 2015-12-16 08:00:00 B
13 5 2015-12-07 08:00:00 2015-12-13 12:25:00 H
I wanted to calculate jumps from one location to another in every 'id'. For these jump counts, first I wanted to compare the date and time of finish column with the date and time of start column of the next row of the same id. If it matches, I want to have the count as 1 other wise 0. What I want to obtain is the following:
id start count
0 1 2015-12-14 16:44:00 1
1 1 2015-12-15 18:00:00 1
2 1 2015-12-16 13:00:00 0
3 2 2015-12-10 13:15:00 1
4 2 2015-12-12 13:45:00 0
5 3 2015-12-15 07:45:00 1
6 3 2015-12-15 18:45:00 1
7 3 2015-12-18 07:15:00 1
8 3 2015-12-19 10:45:00 0
9 4 2015-12-09 10:45:00 1
10 4 2015-12-13 12:20:00 1
11 4 2015-12-13 18:20:00 1
12 4 2015-12-13 23:40:00 0
13 5 2015-12-07 08:00:00 0
Once I have that, I would like to sum the counts based on date to get something like the following:
date count_sum
2015-12-07 0
2015-12-09 1
2015-12-10 1
2015-12-12 0
2015-12-13 2
2015-12-14 1
2015-12-15 3
2015-12-16 0
2015-12-18 1
2015-12-19 0
For me, the last part is easy to do by doing groupby() based on date and using .sum() to sum up all the counts on that date. But how to get the first part where we count the actual jumps is not clear. Any help will be appreciated.
Your data already appears to be sorted by 'start' so you can just groupby and check if the finish time is the same as the start time of the next row with pandas.Series.shift()
I'd advise against calling a column 'count' as this is a built in function for pandas, so you can't use the Series.col_name notation.
#df['start'] = pd.to_datetime(df.start)
#df['finish'] = pd.to_datetime(df.finish)
df['count'] = (df.groupby('id').apply(lambda x: x.finish == x.start.shift(-1))
.astype('int').reset_index(level=0, drop=True))
Output:
id start finish location count
0 1 2015-12-14 16:44:00 2015-12-15 18:00:00 A 1
1 1 2015-12-15 18:00:00 2015-12-16 13:00:00 B 1
2 1 2015-12-16 13:00:00 2015-12-16 20:00:00 C 0
3 2 2015-12-10 13:15:00 2015-12-12 13:45:00 B 1
4 2 2015-12-12 13:45:00 2015-12-12 19:45:00 A 0
5 3 2015-12-15 07:45:00 2015-12-15 18:45:00 A 1
6 3 2015-12-15 18:45:00 2015-12-18 07:15:00 D 1
7 3 2015-12-18 07:15:00 2015-12-19 10:45:00 C 1
8 3 2015-12-19 10:45:00 2015-12-20 09:00:00 H 0
9 4 2015-12-09 10:45:00 2015-12-13 12:20:00 E 1
10 4 2015-12-13 12:20:00 2015-12-13 18:20:00 A 1
11 4 2015-12-13 18:20:00 2015-12-13 23:40:00 A 1
12 4 2015-12-13 23:40:00 2015-12-16 08:00:00 B 0
13 5 2015-12-07 08:00:00 2015-12-13 12:25:00 H 0
And just for completeness:
df.groupby(df.start.dt.date)['count'].sum()
start
2015-12-07 0
2015-12-09 1
2015-12-10 1
2015-12-12 0
2015-12-13 2
2015-12-14 1
2015-12-15 3
2015-12-16 0
2015-12-18 1
2015-12-19 0