Creating nested dataframes with multiple dataframes - pandas

I have multiple dataframes, the following are only 2 of them:
print(df1)
Date A B C
2019-10-01 00:00:00 2 3 1
2019-10-01 01:00:00 5 1 6
2019-10-01 02:00:00 8 2 4
2019-10-01 03:00:00 3 6 5
print(df2)
Date A B C
2019-10-01 00:00:00 9 4 2
2019-10-01 01:00:00 3 2 4
2019-10-01 02:00:00 6 5 2
2019-10-01 03:00:00 3 6 5
All of them have same index and columns. I want to create dataframe like this:
Date df1 df2
A 2019-10-01 00:00:00 2 9
2019-10-01 01:00:00 5 3
2019-10-01 02:00:00 8 6
2019-10-01 03:00:00 3 3
B 2019-10-01 00:00:00 3 4
2019-10-01 01:00:00 1 2
2019-10-01 02:00:00 2 5
2019-10-01 03:00:00 6 6
C 2019-10-01 00:00:00 1 2
2019-10-01 01:00:00 6 4
2019-10-01 02:00:00 4 2
2019-10-01 03:00:00 5 5
I have to apply this process to 30 dataframes(their index and columns are same), so I want to write a for loop in order to achieve this dataframe. How can I do that?

Reshape each DataFrame of list of DataFrames by DataFrame.set_index with DataFrame.unstack and then concat, last change columns names with lambda function:
dfs = [df1,df2]
df = (pd.concat([x.set_index('Date').unstack() for x in dfs], axis=1)
.rename(columns=lambda x: f'df{x+1}'))
print (df)
df1 df2
Date
A 2019-10-01 00:00:00 2 9
2019-10-01 01:00:00 5 3
2019-10-01 02:00:00 8 6
2019-10-01 03:00:00 3 3
B 2019-10-01 00:00:00 3 4
2019-10-01 01:00:00 1 2
2019-10-01 02:00:00 2 5
2019-10-01 03:00:00 6 6
C 2019-10-01 00:00:00 1 2
2019-10-01 01:00:00 6 4
2019-10-01 02:00:00 4 2
2019-10-01 03:00:00 5 5
If want some custom columns names in final DataFrame create list with same size like length of dfs and add parameter keys:
dfs = [df1,df2]
names = ['col1','col2']
df = pd.concat([x.set_index('Date').unstack() for x in dfs], keys=names, axis=1)
print (df)
col1 col2
Date
A 2019-10-01 00:00:00 2 9
2019-10-01 01:00:00 5 3
2019-10-01 02:00:00 8 6
2019-10-01 03:00:00 3 3
B 2019-10-01 00:00:00 3 4
2019-10-01 01:00:00 1 2
2019-10-01 02:00:00 2 5
2019-10-01 03:00:00 6 6
C 2019-10-01 00:00:00 1 2
2019-10-01 01:00:00 6 4
2019-10-01 02:00:00 4 2
2019-10-01 03:00:00 5 5

Related

Pandas: create a period based on date column

I have a dataframe
ID datetime
11 01-09-2021 10:00:00
11 01-09-2021 10:15:15
11 01-09-2021 15:00:00
12 01-09-2021 15:10:00
11 01-09-2021 18:00:00
I need to add period based just on datetime if it increases to 2 hours
ID datetime period
11 01-09-2021 10:00:00 1
11 01-09-2021 10:15:15 1
11 01-09-2021 15:00:00 2
12 01-09-2021 15:10:00 2
11 01-09-2021 18:00:00 3
And the same thing but based on ID and datetime
ID datetime period
11 01-09-2021 10:00:00 1
11 01-09-2021 10:15:15 1
11 01-09-2021 15:00:00 2
12 01-09-2021 15:10:00 1
11 01-09-2021 18:00:00 3
How can I do that?
You can get difference by Series.diff, convert to hours Series.dt.total_seconds, comapre for 2 and add cumulative sum:
df['period'] = df['datetime'].diff().dt.total_seconds().div(3600).gt(2).cumsum().add(1)
print (df)
ID datetime period
0 11 2021-01-09 10:00:00 1
1 11 2021-01-09 10:15:15 1
2 11 2021-01-09 15:00:00 2
3 12 2021-01-09 15:10:00 2
4 11 2021-01-09 18:00:00 3
Similar idea per groups:
f = lambda x: x.diff().dt.total_seconds().div(3600).gt(2).cumsum().add(1)
df['period'] = df.groupby('ID')['datetime'].transform(f)
print (df)
ID datetime period
0 11 2021-01-09 10:00:00 1
1 11 2021-01-09 10:15:15 1
2 11 2021-01-09 15:00:00 2
3 12 2021-01-09 15:10:00 1
4 11 2021-01-09 18:00:00 3

7 days hourly mean with pandas

I need some help calculating a 7 days mean for every hour.
The timeseries has a hourly resolution and I need the 7 days mean for each hour e.g. for 13 o'clock
date, x
2020-07-01 13:00 , 4
2020-07-01 14:00 , 3
.
.
.
2020-07-02 13:00 , 3
2020-07-02 14:00 , 7
.
.
.
I tried it with pandas and a rolling mean, but rolling includes last 7 days.
Thanks for any hints!
Add a new hour column, grouping by hour column, and then add
The average was calculated over 7 days. This is consistent with the intent of the question.
df['hour'] = df.index.hour
df = df.groupby(df.hour)['x'].rolling(7).mean().reset_index()
df.head(35)
hour level_1 x
0 0 2020-07-01 00:00:00 NaN
1 0 2020-07-02 00:00:00 NaN
2 0 2020-07-03 00:00:00 NaN
3 0 2020-07-04 00:00:00 NaN
4 0 2020-07-05 00:00:00 NaN
5 0 2020-07-06 00:00:00 NaN
6 0 2020-07-07 00:00:00 48.142857
7 0 2020-07-08 00:00:00 50.285714
8 0 2020-07-09 00:00:00 60.000000
9 0 2020-07-10 00:00:00 63.142857
10 1 2020-07-01 01:00:00 NaN
11 1 2020-07-02 01:00:00 NaN
12 1 2020-07-03 01:00:00 NaN
13 1 2020-07-04 01:00:00 NaN
14 1 2020-07-05 01:00:00 NaN
15 1 2020-07-06 01:00:00 NaN
16 1 2020-07-07 01:00:00 52.571429
17 1 2020-07-08 01:00:00 48.428571
18 1 2020-07-09 01:00:00 38.000000
19 2 2020-07-01 02:00:00 NaN
20 2 2020-07-02 02:00:00 NaN
21 2 2020-07-03 02:00:00 NaN
22 2 2020-07-04 02:00:00 NaN
23 2 2020-07-05 02:00:00 NaN
24 2 2020-07-06 02:00:00 NaN
25 2 2020-07-07 02:00:00 46.571429
26 2 2020-07-08 02:00:00 47.714286
27 2 2020-07-09 02:00:00 42.714286
28 3 2020-07-01 03:00:00 NaN
29 3 2020-07-02 03:00:00 NaN
30 3 2020-07-03 03:00:00 NaN
31 3 2020-07-04 03:00:00 NaN
32 3 2020-07-05 03:00:00 NaN
33 3 2020-07-06 03:00:00 NaN
34 3 2020-07-07 03:00:00 72.571429

Pandas group by datetime within column level

I have a dataframe created by:
df = pd.DataFrame({})
df['Date'] = pd.to_datetime(np.arange(0,12), unit='h', origin='2018-08-01 06:00:00')
df['ship'] = [1,1,2,2,2,3,3,3,3,3,3,3] # ship ID number
dt_trip = 4 # maximum duration of each trip to be classified as the same trip
Date ship
0 2018-08-01 06:00:00 1
1 2018-08-01 07:00:00 1
2 2018-08-01 08:00:00 2
3 2018-08-01 09:00:00 2
4 2018-08-01 10:00:00 2
5 2018-08-01 11:00:00 3
6 2018-08-01 12:00:00 3
7 2018-08-01 13:00:00 3
8 2018-08-01 14:00:00 3
9 2018-08-01 15:00:00 3
10 2018-08-01 16:00:00 3
11 2018-08-01 17:00:00 3
I try to get a a new column which shows the trips of each ship. Each trip is defined by an interval of 4 hours with respect to the start of the trip. When a new ship number is on the next row, automatically a new trip should start (irrespective of the previous datetime). From a previous post I got a solution for the trips.
origin = df["Date"][0].hour
df["Trip"] = df.apply(lambda x: ((x["Date"].hour - origin) // dt_trip) + 1, axis=1)
df["Trip"] = df.groupby(['Trip','ship']).ngroup() +1 # trip starts at: 1
This solution takes a new trip when the ship-column changes its row. The only change I want to have is to change the origin to the datetime when a new trip starts. So index 4 should have Trip = 2, because the ship is the same and the time difference between the start of the trip (index=2). Now it looks at the first given datetime.
Desired solution looks like:
Date ship Trip Trip_desired
0 2018-08-01 06:00:00 1 1 1
1 2018-08-01 07:00:00 1 1 1
2 2018-08-01 08:00:00 2 2 2
3 2018-08-01 09:00:00 2 2 2
4 2018-08-01 10:00:00 2 3 2
5 2018-08-01 11:00:00 3 4 3
6 2018-08-01 12:00:00 3 4 3
7 2018-08-01 13:00:00 3 4 3
8 2018-08-01 14:00:00 3 5 3
9 2018-08-01 15:00:00 3 5 4
10 2018-08-01 16:00:00 3 5 4
11 2018-08-01 17:00:00 3 5 4
I would do:
total_time = df['Date'] - df.groupby('ship')['Date'].transform('min')
trips = total_time.dt.total_seconds().fillna(0)//(dt_trip*3600)
df['trip'] = df.groupby(['ship', trips]).ngroup()+1
Output:
Date ship trip
0 2018-08-01 06:00:00 1 1
1 2018-08-01 07:00:00 1 1
2 2018-08-01 08:00:00 2 2
3 2018-08-01 09:00:00 2 2
4 2018-08-01 10:00:00 2 2
5 2018-08-01 11:00:00 3 3
6 2018-08-01 12:00:00 3 3
7 2018-08-01 13:00:00 3 3
8 2018-08-01 14:00:00 3 3
9 2018-08-01 15:00:00 3 4
10 2018-08-01 16:00:00 3 4
11 2018-08-01 17:00:00 3 4

Splitting value dataframe over multiple timeslots

Would like to spread the values of the 15 minute intervals evenly over the 5 minute intervals. But cannot get it to work. Data is:
Datetime a
2018-01-01 00:00:00 6
2018-01-01 00:15:00 3
2018-01-01 00:30:00 9
Desired output would be:
Datetime a
2018-01-01 00:00:00 2
2018-01-11 00:05:00 2
2018-01-11 00:10:00 2
2018-01-11 00:15:00 1
2018-01-11 00:20:00 1
2018-01-11 00:25:00 1
2018-01-11 00:30:00 3
2018-01-11 00:35:00 3
2018-01-11 00:40:00 3
perhaps unnecessarily, but the value '6' of 00:00:00 in the data is spread over the intervals 00:00:00-00:10:00
Slightly different approach:
# convert to datetime
df.Datetime = pd.to_datetime(df.Datetime)
# set Datetime as index
df.set_index('Datetime', inplace=True)
# add one extra row
df.loc[df.index.max()+pd.to_timedelta('10min')] = 0
# set_index and resample
s = df.asfreq('5T', fill_value=0)
# transform the 0's to mean:
(s.groupby(s['a'].ne(0)
.cumsum())
.transform('mean')
.reset_index()
)
Output:
Datetime a
0 2018-01-01 00:00:00 2
1 2018-01-01 00:05:00 2
2 2018-01-01 00:10:00 2
3 2018-01-01 00:15:00 1
4 2018-01-01 00:20:00 1
5 2018-01-01 00:25:00 1
6 2018-01-01 00:30:00 3
7 2018-01-01 00:35:00 3
8 2018-01-01 00:40:00 3

Groupby by different columns

I have a dataframe nf as follows :
StationID DateTime Channel Count
0 1 2017-10-01 00:00:00 1 1
1 1 2017-10-01 00:00:00 1 201
2 1 2017-10-01 00:00:00 1 8
3 1 2017-10-01 00:00:00 1 2
4 1 2017-10-01 00:00:00 1 0
5 1 2017-10-01 00:00:00 1 0
6 1 2017-10-01 00:00:00 1 0
7 1 2017-10-01 00:00:00 1 0
.......... and so on
I want to groupby values by each hour and for each channel and StationID
Output Req
Station ID DateTime Channel Count
1 2017-10-01 00:00:00 1 232
1 2017-10-01 00:01:00 1 23
2 2017-10-01 00:00:00 1 244...
...... and so on
I think you need groupby with aggregate sum, for datetimes with floor by hours add floor - it set minutes and seconds to 0:
print (df)
StationID DateTime Channel Count
0 1 2017-12-01 00:00:00 1 1
1 1 2017-12-01 00:00:00 1 201
2 1 2017-12-01 00:10:00 1 8
3 1 2017-12-01 10:00:00 1 2
4 1 2017-10-01 10:50:00 1 0
5 1 2017-10-01 10:20:00 1 5
6 1 2017-10-01 08:10:00 1 4
7 1 2017-10-01 08:00:00 1 1
df['DateTime'] = pd.to_datetime(df['DateTime'])
df1 = (df.groupby(['StationID', df['DateTime'].dt.floor('H'), 'Channel'])['Count']
.sum()
.reset_index()
)
print (df1)
StationID DateTime Channel Count
0 1 2017-10-01 08:00:00 1 5
1 1 2017-10-01 10:00:00 1 5
2 1 2017-12-01 00:00:00 1 210
3 1 2017-12-01 10:00:00 1 2
print (df['DateTime'].dt.floor('H'))
0 2017-12-01 00:00:00
1 2017-12-01 00:00:00
2 2017-12-01 00:00:00
3 2017-12-01 10:00:00
4 2017-10-01 10:00:00
5 2017-10-01 10:00:00
6 2017-10-01 08:00:00
7 2017-10-01 08:00:00
Name: DateTime, dtype: datetime64[ns]
But if dates are not important, only hours use hour:
df2 = (df.groupby(['StationID', df['DateTime'].dt.hour, 'Channel'])['Count']
.sum()
.reset_index()
)
print (df2)
StationID DateTime Channel Count
0 1 0 1 210
1 1 8 1 5
2 1 10 1 7
Or you can use Grouper:
df.groupby(pd.Grouper(key='DateTime', freq='"H'), 'Channel', 'StationID')['Count'].sum()