Pandas group by datetime within column level - pandas

I have a dataframe created by:
df = pd.DataFrame({})
df['Date'] = pd.to_datetime(np.arange(0,12), unit='h', origin='2018-08-01 06:00:00')
df['ship'] = [1,1,2,2,2,3,3,3,3,3,3,3] # ship ID number
dt_trip = 4 # maximum duration of each trip to be classified as the same trip
Date ship
0 2018-08-01 06:00:00 1
1 2018-08-01 07:00:00 1
2 2018-08-01 08:00:00 2
3 2018-08-01 09:00:00 2
4 2018-08-01 10:00:00 2
5 2018-08-01 11:00:00 3
6 2018-08-01 12:00:00 3
7 2018-08-01 13:00:00 3
8 2018-08-01 14:00:00 3
9 2018-08-01 15:00:00 3
10 2018-08-01 16:00:00 3
11 2018-08-01 17:00:00 3
I try to get a a new column which shows the trips of each ship. Each trip is defined by an interval of 4 hours with respect to the start of the trip. When a new ship number is on the next row, automatically a new trip should start (irrespective of the previous datetime). From a previous post I got a solution for the trips.
origin = df["Date"][0].hour
df["Trip"] = df.apply(lambda x: ((x["Date"].hour - origin) // dt_trip) + 1, axis=1)
df["Trip"] = df.groupby(['Trip','ship']).ngroup() +1 # trip starts at: 1
This solution takes a new trip when the ship-column changes its row. The only change I want to have is to change the origin to the datetime when a new trip starts. So index 4 should have Trip = 2, because the ship is the same and the time difference between the start of the trip (index=2). Now it looks at the first given datetime.
Desired solution looks like:
Date ship Trip Trip_desired
0 2018-08-01 06:00:00 1 1 1
1 2018-08-01 07:00:00 1 1 1
2 2018-08-01 08:00:00 2 2 2
3 2018-08-01 09:00:00 2 2 2
4 2018-08-01 10:00:00 2 3 2
5 2018-08-01 11:00:00 3 4 3
6 2018-08-01 12:00:00 3 4 3
7 2018-08-01 13:00:00 3 4 3
8 2018-08-01 14:00:00 3 5 3
9 2018-08-01 15:00:00 3 5 4
10 2018-08-01 16:00:00 3 5 4
11 2018-08-01 17:00:00 3 5 4

I would do:
total_time = df['Date'] - df.groupby('ship')['Date'].transform('min')
trips = total_time.dt.total_seconds().fillna(0)//(dt_trip*3600)
df['trip'] = df.groupby(['ship', trips]).ngroup()+1
Output:
Date ship trip
0 2018-08-01 06:00:00 1 1
1 2018-08-01 07:00:00 1 1
2 2018-08-01 08:00:00 2 2
3 2018-08-01 09:00:00 2 2
4 2018-08-01 10:00:00 2 2
5 2018-08-01 11:00:00 3 3
6 2018-08-01 12:00:00 3 3
7 2018-08-01 13:00:00 3 3
8 2018-08-01 14:00:00 3 3
9 2018-08-01 15:00:00 3 4
10 2018-08-01 16:00:00 3 4
11 2018-08-01 17:00:00 3 4

Related

Pandas: create a period based on date column

I have a dataframe
ID datetime
11 01-09-2021 10:00:00
11 01-09-2021 10:15:15
11 01-09-2021 15:00:00
12 01-09-2021 15:10:00
11 01-09-2021 18:00:00
I need to add period based just on datetime if it increases to 2 hours
ID datetime period
11 01-09-2021 10:00:00 1
11 01-09-2021 10:15:15 1
11 01-09-2021 15:00:00 2
12 01-09-2021 15:10:00 2
11 01-09-2021 18:00:00 3
And the same thing but based on ID and datetime
ID datetime period
11 01-09-2021 10:00:00 1
11 01-09-2021 10:15:15 1
11 01-09-2021 15:00:00 2
12 01-09-2021 15:10:00 1
11 01-09-2021 18:00:00 3
How can I do that?
You can get difference by Series.diff, convert to hours Series.dt.total_seconds, comapre for 2 and add cumulative sum:
df['period'] = df['datetime'].diff().dt.total_seconds().div(3600).gt(2).cumsum().add(1)
print (df)
ID datetime period
0 11 2021-01-09 10:00:00 1
1 11 2021-01-09 10:15:15 1
2 11 2021-01-09 15:00:00 2
3 12 2021-01-09 15:10:00 2
4 11 2021-01-09 18:00:00 3
Similar idea per groups:
f = lambda x: x.diff().dt.total_seconds().div(3600).gt(2).cumsum().add(1)
df['period'] = df.groupby('ID')['datetime'].transform(f)
print (df)
ID datetime period
0 11 2021-01-09 10:00:00 1
1 11 2021-01-09 10:15:15 1
2 11 2021-01-09 15:00:00 2
3 12 2021-01-09 15:10:00 1
4 11 2021-01-09 18:00:00 3

Creating nested dataframes with multiple dataframes

I have multiple dataframes, the following are only 2 of them:
print(df1)
Date A B C
2019-10-01 00:00:00 2 3 1
2019-10-01 01:00:00 5 1 6
2019-10-01 02:00:00 8 2 4
2019-10-01 03:00:00 3 6 5
print(df2)
Date A B C
2019-10-01 00:00:00 9 4 2
2019-10-01 01:00:00 3 2 4
2019-10-01 02:00:00 6 5 2
2019-10-01 03:00:00 3 6 5
All of them have same index and columns. I want to create dataframe like this:
Date df1 df2
A 2019-10-01 00:00:00 2 9
2019-10-01 01:00:00 5 3
2019-10-01 02:00:00 8 6
2019-10-01 03:00:00 3 3
B 2019-10-01 00:00:00 3 4
2019-10-01 01:00:00 1 2
2019-10-01 02:00:00 2 5
2019-10-01 03:00:00 6 6
C 2019-10-01 00:00:00 1 2
2019-10-01 01:00:00 6 4
2019-10-01 02:00:00 4 2
2019-10-01 03:00:00 5 5
I have to apply this process to 30 dataframes(their index and columns are same), so I want to write a for loop in order to achieve this dataframe. How can I do that?
Reshape each DataFrame of list of DataFrames by DataFrame.set_index with DataFrame.unstack and then concat, last change columns names with lambda function:
dfs = [df1,df2]
df = (pd.concat([x.set_index('Date').unstack() for x in dfs], axis=1)
.rename(columns=lambda x: f'df{x+1}'))
print (df)
df1 df2
Date
A 2019-10-01 00:00:00 2 9
2019-10-01 01:00:00 5 3
2019-10-01 02:00:00 8 6
2019-10-01 03:00:00 3 3
B 2019-10-01 00:00:00 3 4
2019-10-01 01:00:00 1 2
2019-10-01 02:00:00 2 5
2019-10-01 03:00:00 6 6
C 2019-10-01 00:00:00 1 2
2019-10-01 01:00:00 6 4
2019-10-01 02:00:00 4 2
2019-10-01 03:00:00 5 5
If want some custom columns names in final DataFrame create list with same size like length of dfs and add parameter keys:
dfs = [df1,df2]
names = ['col1','col2']
df = pd.concat([x.set_index('Date').unstack() for x in dfs], keys=names, axis=1)
print (df)
col1 col2
Date
A 2019-10-01 00:00:00 2 9
2019-10-01 01:00:00 5 3
2019-10-01 02:00:00 8 6
2019-10-01 03:00:00 3 3
B 2019-10-01 00:00:00 3 4
2019-10-01 01:00:00 1 2
2019-10-01 02:00:00 2 5
2019-10-01 03:00:00 6 6
C 2019-10-01 00:00:00 1 2
2019-10-01 01:00:00 6 4
2019-10-01 02:00:00 4 2
2019-10-01 03:00:00 5 5

Splitting value dataframe over multiple timeslots

Would like to spread the values of the 15 minute intervals evenly over the 5 minute intervals. But cannot get it to work. Data is:
Datetime a
2018-01-01 00:00:00 6
2018-01-01 00:15:00 3
2018-01-01 00:30:00 9
Desired output would be:
Datetime a
2018-01-01 00:00:00 2
2018-01-11 00:05:00 2
2018-01-11 00:10:00 2
2018-01-11 00:15:00 1
2018-01-11 00:20:00 1
2018-01-11 00:25:00 1
2018-01-11 00:30:00 3
2018-01-11 00:35:00 3
2018-01-11 00:40:00 3
perhaps unnecessarily, but the value '6' of 00:00:00 in the data is spread over the intervals 00:00:00-00:10:00
Slightly different approach:
# convert to datetime
df.Datetime = pd.to_datetime(df.Datetime)
# set Datetime as index
df.set_index('Datetime', inplace=True)
# add one extra row
df.loc[df.index.max()+pd.to_timedelta('10min')] = 0
# set_index and resample
s = df.asfreq('5T', fill_value=0)
# transform the 0's to mean:
(s.groupby(s['a'].ne(0)
.cumsum())
.transform('mean')
.reset_index()
)
Output:
Datetime a
0 2018-01-01 00:00:00 2
1 2018-01-01 00:05:00 2
2 2018-01-01 00:10:00 2
3 2018-01-01 00:15:00 1
4 2018-01-01 00:20:00 1
5 2018-01-01 00:25:00 1
6 2018-01-01 00:30:00 3
7 2018-01-01 00:35:00 3
8 2018-01-01 00:40:00 3

Groupby by different columns

I have a dataframe nf as follows :
StationID DateTime Channel Count
0 1 2017-10-01 00:00:00 1 1
1 1 2017-10-01 00:00:00 1 201
2 1 2017-10-01 00:00:00 1 8
3 1 2017-10-01 00:00:00 1 2
4 1 2017-10-01 00:00:00 1 0
5 1 2017-10-01 00:00:00 1 0
6 1 2017-10-01 00:00:00 1 0
7 1 2017-10-01 00:00:00 1 0
.......... and so on
I want to groupby values by each hour and for each channel and StationID
Output Req
Station ID DateTime Channel Count
1 2017-10-01 00:00:00 1 232
1 2017-10-01 00:01:00 1 23
2 2017-10-01 00:00:00 1 244...
...... and so on
I think you need groupby with aggregate sum, for datetimes with floor by hours add floor - it set minutes and seconds to 0:
print (df)
StationID DateTime Channel Count
0 1 2017-12-01 00:00:00 1 1
1 1 2017-12-01 00:00:00 1 201
2 1 2017-12-01 00:10:00 1 8
3 1 2017-12-01 10:00:00 1 2
4 1 2017-10-01 10:50:00 1 0
5 1 2017-10-01 10:20:00 1 5
6 1 2017-10-01 08:10:00 1 4
7 1 2017-10-01 08:00:00 1 1
df['DateTime'] = pd.to_datetime(df['DateTime'])
df1 = (df.groupby(['StationID', df['DateTime'].dt.floor('H'), 'Channel'])['Count']
.sum()
.reset_index()
)
print (df1)
StationID DateTime Channel Count
0 1 2017-10-01 08:00:00 1 5
1 1 2017-10-01 10:00:00 1 5
2 1 2017-12-01 00:00:00 1 210
3 1 2017-12-01 10:00:00 1 2
print (df['DateTime'].dt.floor('H'))
0 2017-12-01 00:00:00
1 2017-12-01 00:00:00
2 2017-12-01 00:00:00
3 2017-12-01 10:00:00
4 2017-10-01 10:00:00
5 2017-10-01 10:00:00
6 2017-10-01 08:00:00
7 2017-10-01 08:00:00
Name: DateTime, dtype: datetime64[ns]
But if dates are not important, only hours use hour:
df2 = (df.groupby(['StationID', df['DateTime'].dt.hour, 'Channel'])['Count']
.sum()
.reset_index()
)
print (df2)
StationID DateTime Channel Count
0 1 0 1 210
1 1 8 1 5
2 1 10 1 7
Or you can use Grouper:
df.groupby(pd.Grouper(key='DateTime', freq='"H'), 'Channel', 'StationID')['Count'].sum()

Split DateTimeIndex data based on hour/minute/second

I have time-series data that I would like to split based on hour, or minute, or second. This is generally user-defined. I would like to know how it can be done.
For example, consider the following:
test = pd.DataFrame({'TIME': pd.date_range(start='2016-09-30',
freq='600s', periods=20)})
test['X'] = np.arange(20)
The output is:
TIME X
0 2016-09-30 00:00:00 0
1 2016-09-30 00:10:00 1
2 2016-09-30 00:20:00 2
3 2016-09-30 00:30:00 3
4 2016-09-30 00:40:00 4
5 2016-09-30 00:50:00 5
6 2016-09-30 01:00:00 6
7 2016-09-30 01:10:00 7
8 2016-09-30 01:20:00 8
9 2016-09-30 01:30:00 9
10 2016-09-30 01:40:00 10
11 2016-09-30 01:50:00 11
12 2016-09-30 02:00:00 12
13 2016-09-30 02:10:00 13
14 2016-09-30 02:20:00 14
15 2016-09-30 02:30:00 15
16 2016-09-30 02:40:00 16
17 2016-09-30 02:50:00 17
18 2016-09-30 03:00:00 18
19 2016-09-30 03:10:00 19
Suppose I want to split it by hour. I would like the following as one chunk which I can then save to a file.
TIME X
0 2016-09-30 00:00:00 0
1 2016-09-30 00:10:00 1
2 2016-09-30 00:20:00 2
3 2016-09-30 00:30:00 3
4 2016-09-30 00:40:00 4
5 2016-09-30 00:50:00 5
The second chunk would be:
TIME X
0 2016-09-30 01:00:00 6
1 2016-09-30 01:10:00 7
2 2016-09-30 01:20:00 8
3 2016-09-30 01:30:00 9
4 2016-09-30 01:40:00 10
5 2016-09-30 01:50:00 11
and so on...
Note I can do it purely based on logical conditions such as,
df[(df['TIME'] >= '2016-09-30 00:00:00') &
(df['TIME'] <= '2016-09-30 00:50:00')]
and repeat....
but what if my sampling changes? Is there a way to create a mask or something that takes less amount of code and is efficient? I have 10 GB of data.
Option 1
you can groupby series without having them in the object you're grouping.
test.groupby([test.TIME.dt.date,
test.TIME.dt.hour,
test.TIME.dt.minute,
test.TIME.dt.second]):
Option 2
use pd.TimeGrouper
test.set_index('TIME').groupby(pd.TimeGrouper('S')) # Group by seconds
test.set_index('TIME').groupby(pd.TimeGrouper('T')) # Group by minutes
test.set_index('TIME').groupby(pd.TimeGrouper('H')) # Group by hours
You need to use groupby for this, and the grouping should be based on date and hour:
test['DATE'] = test['TIME'].dt.date
test['HOUR'] = test['TIME'].dt.hour
grp = test.groupby(['DATE', 'HOUR'])
You can then loop over the groups and do the operation you want.
Example:
for key, df in grp:
print(key, df)
((datetime.date(2016, 9, 30), 0), TIME X DATE HOUR
0 2016-09-30 00:00:00 0 2016-09-30 0
1 2016-09-30 00:10:00 1 2016-09-30 0
2 2016-09-30 00:20:00 2 2016-09-30 0
3 2016-09-30 00:30:00 3 2016-09-30 0
4 2016-09-30 00:40:00 4 2016-09-30 0
5 2016-09-30 00:50:00 5 2016-09-30 0)
((datetime.date(2016, 9, 30), 1), TIME X DATE HOUR
6 2016-09-30 01:00:00 6 2016-09-30 1
7 2016-09-30 01:10:00 7 2016-09-30 1
8 2016-09-30 01:20:00 8 2016-09-30 1
9 2016-09-30 01:30:00 9 2016-09-30 1
10 2016-09-30 01:40:00 10 2016-09-30 1
11 2016-09-30 01:50:00 11 2016-09-30 1)
((datetime.date(2016, 9, 30), 2), TIME X DATE HOUR
12 2016-09-30 02:00:00 12 2016-09-30 2
13 2016-09-30 02:10:00 13 2016-09-30 2
14 2016-09-30 02:20:00 14 2016-09-30 2
15 2016-09-30 02:30:00 15 2016-09-30 2
16 2016-09-30 02:40:00 16 2016-09-30 2
17 2016-09-30 02:50:00 17 2016-09-30 2)
((datetime.date(2016, 9, 30), 3), TIME X DATE HOUR
18 2016-09-30 03:00:00 18 2016-09-30 3
19 2016-09-30 03:10:00 19 2016-09-30 3)