Add header to .data file in Pandas - pandas
Given a file with the extention of .data, I have read it with pd.read_fwf("./input.data", sep=",", header = None):
Out:
0
0 63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3...
1 67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5...
2 67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6...
3 37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5...
4 41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4...
... ...
292 57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2...
293 45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2...
294 68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4...
295 57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2...
296 57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0...
How can I add the following column names to it? Thanks.
col_names = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg",
"thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"]
Update:
pd.read_fwf("./input.data", names = col_names)
Out:
age sex cp restbp chol fbs restecg thalach exang oldpeak slope ca thal num
0 63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
292 57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
293 45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
294 68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
295 57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
296 57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0... NaN NaN NaN NaN NaN NaN
If check read_fwf:
Read a table of fixed-width formatted lines into DataFrame.
So if there is separator , use read_csv:
col_names = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg",
"thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"]
df = pd.read_csv("input.data", names=col_names)
print (df)
age sex cp restbp chol fbs restecg thalach exang oldpeak \
0 63.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3
1 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5
2 67.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6
3 37.0 1.0 3.0 130.0 250.0 0.0 0.0 187.0 0.0 3.5
4 41.0 0.0 2.0 130.0 204.0 0.0 2.0 172.0 0.0 1.4
.. ... ... ... ... ... ... ... ... ... ...
292 57.0 0.0 4.0 140.0 241.0 0.0 0.0 123.0 1.0 0.2
293 45.0 1.0 1.0 110.0 264.0 0.0 0.0 132.0 0.0 1.2
294 68.0 1.0 4.0 144.0 193.0 1.0 0.0 141.0 0.0 3.4
295 57.0 1.0 4.0 130.0 131.0 0.0 0.0 115.0 1.0 1.2
296 57.0 0.0 2.0 130.0 236.0 0.0 2.0 174.0 0.0 0.0
slope ca thal num
0 3.0 0.0 6.0 0
1 2.0 3.0 3.0 1
2 2.0 2.0 7.0 1
3 3.0 0.0 3.0 0
4 1.0 0.0 3.0 0
.. ... ... ... ...
292 2.0 0.0 7.0 1
293 2.0 0.0 7.0 1
294 2.0 2.0 7.0 1
295 2.0 1.0 7.0 1
296 2.0 1.0 3.0 1
[297 rows x 14 columns]
Just do a read_csv without header and pass col_names:
df = pd.read_csv('input.data', header=None, names=col_names);
Output (head):
age sex cp restbp chol fbs restecg thalach exang oldpeak slope ca thal num
-- ----- ----- ---- -------- ------ ----- --------- --------- ------- --------- ------- ---- ------ -----
0 63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
1 67 1 4 160 286 0 2 108 1 1.5 2 3 3 1
2 67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
3 37 1 3 130 250 0 0 187 0 3.5 3 0 3 0
4 41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
Related
How to count months with at least 1 non NaN value?
I have this df: CODE YEAR MONTH DAY TMAX TMIN PP 0 130 1991 1 1 32.6 23.4 0.0 1 130 1991 1 2 31.2 22.4 0.0 2 130 1991 1 3 32.0 NaN 0.0 3 130 1991 1 4 32.2 23.0 0.0 4 130 1991 1 5 30.5 22.0 0.0 ... ... ... ... ... ... ... 20118 130 2018 9 30 31.8 21.2 NaN 30028 132 1991 1 1 35.2 NaN 0.0 30029 132 1991 1 2 34.6 NaN 0.0 30030 132 1991 1 3 35.8 NaN 0.0 30031 132 1991 1 4 34.8 NaN 0.0 ... ... ... ... ... ... ... 45000 132 2019 10 5 35.5 NaN 21.1 46500 133 1991 1 1 35.5 NaN 21.1 I need to count months that have at least 1 non NaN value in TMAX,TMIN and PP columns. If the month have all nan values that month doesn't count. I need to do this by each CODE. Expected value: CODE YEAR MONTH DAY TMAX TMIN PP JANUARY_TMAX FEBRUARY_TMAX MARCH_TMAX APRIL_TMAX etc 130 1991 1 1 32.6 23.4 0 23 25 22 27 … 130 1991 1 2 31.2 22.4 0 NaN NaN NaN NaN NaN 130 1991 1 3 32 NaN 0 NaN NaN NaN NaN NaN 130 1991 1 4 32.2 23 0 NaN NaN NaN NaN NaN 130 1991 1 5 30.5 22 0 NaN NaN NaN NaN NaN ... ... ... ... ... ... ... NaN NaN NaN NaN NaN 130 2018 9 30 31.8 21.2 NaN NaN NaN NaN NaN NaN 132 1991 1 1 35.2 NaN 0 21 23 22 22 … 132 1991 1 2 34.6 NaN 0 NaN NaN NaN NaN NaN 132 1991 1 3 35.8 NaN 0 NaN NaN NaN NaN NaN 132 1991 1 4 34.8 NaN 0 NaN NaN NaN NaN NaN ... ... ... ... ... ... ... NaN NaN NaN NaN NaN 132 2019 1 1 35.5 NaN 21.1 NaN NaN NaN NaN NaN ... ... ... ... ... ... ... NaN NaN NaN NaN NaN 133 1991 1 1 35.5 NaN 21.1 25 22 22 21 … ... ... ... ... ... ... ... NaN NaN NaN NaN NaN For example: In code 130 for TMAX column, i have 23 Januarys that have at least 1 non NaN value, i have 25 Februarys that have at least 1 non NaN value, etc. Would you mind to help me? Thanks in advance.
This may not be super efficient, but here is how you can do it for one of columns, TMAX in this case. Just repeat the process for the other columns. # Count occurrences of each month when TMAX is not null tmax_cts_long = df[df.TMAX.notnull()].drop_duplicates(subset=['CODE', 'YEAR', 'MONTH']).groupby(['CODE', 'MONTH']).size().reset_index(name='COUNT') # Transpose the long table of counts to wide format tmax_cts_wide = tmax_cts_long.pivot(index='CODE', columns='MONTH', values='COUNT') # Merge table of counts with the original dataframe final_df = df.merge(tmax_cts_wide, on='CODE', how='left') # Replace values in new columns in all rows after the first row with NaN mask = final_df.index.isin(df.groupby(['CODE', 'MONTH']).head(1).index) final_df.loc[~mask, [col for col in final_df.columns if isinstance(col, int)]] = None # Rename new columns to follow the desired naming format mon_dict = {1: 'JANUARY', 2: 'FEBRUARY', ...} tmax_mon_dict = {k: v + '_TMAX' for k, v in mon_dict.items()} final_df.rename(columns=tmax_mon_dict, inplace=True)
How to remove periods of time in a dataframe?
I have this df: CODE YEAR MONTH DAY TMAX TMIN PP BAD PERIOD 1 BAD PERIOD 2 9984 000130 1991 1 1 32.6 23.4 0.0 1991 1998 9985 000130 1991 1 2 31.2 22.4 0.0 NaN NaN 9986 000130 1991 1 3 32.0 NaN 0.0 NaN NaN 9987 000130 1991 1 4 32.2 23.0 0.0 NaN NaN 9988 000130 1991 1 5 30.5 22.0 0.0 NaN NaN ... ... ... ... ... ... ... 20118 000130 2018 9 30 31.8 21.2 NaN NaN NaN 30028 000132 1991 1 1 35.2 NaN 0.0 2005 2010 30029 000132 1991 1 2 34.6 NaN 0.0 NaN NaN 30030 000132 1991 1 3 35.8 NaN 0.0 NaN NaN 30031 000132 1991 1 4 34.8 NaN 0.0 NaN NaN ... ... ... ... ... ... ... 50027 000132 2019 10 5 36.5 NaN 13.1 NaN NaN 50028 000133 1991 1 1 36.2 NaN 0.0 1991 2010 50029 000133 1991 1 2 36.6 NaN 0.0 NaN NaN 50030 000133 1991 1 3 36.8 NaN 5.0 NaN NaN 50031 000133 1991 1 4 36.8 NaN 0.0 NaN NaN ... ... ... ... ... ... ... 54456 000133 2019 10 5 36.5 NaN 12.1 NaN NaN I want to change the values of the columns TMAX TMIN and PP to NaN, only of the periods specified in Bad Period 1 and Bad period 2 AND ONLY IN THEIR RESPECTIVE CODE. For example if I have Bad Period 1 equal to 1991 and Bad period 2 equal to 1998 I want all the values of TMAX, TMIN and PP that have code 000130 have NaN values since 1991 (bad period 1) to 1998 (bad period 2). I have 371 unique CODES in CODE column so i might use df.groupby("CODE"). Expected result after the change: CODE YEAR MONTH DAY TMAX TMIN PP BAD PERIOD 1 BAD PERIOD 2 9984 000130 1991 1 1 NaN NaN NaN 1991 1998 9985 000130 1991 1 2 NaN NaN NaN NaN NaN 9986 000130 1991 1 3 NaN NaN NaN NaN NaN 9987 000130 1991 1 4 NaN NaN NaN NaN NaN 9988 000130 1991 1 5 NaN NaN NaN NaN NaN ... ... ... ... ... ... ... 20118 000130 2018 9 30 31.8 21.2 NaN NaN NaN 30028 000132 1991 1 1 35.2 NaN 0.0 2005 2010 30029 000132 1991 1 2 34.6 NaN 0.0 NaN NaN 30030 000132 1991 1 3 35.8 NaN 0.0 NaN NaN 30031 000132 1991 1 4 34.8 NaN 0.0 NaN NaN ... ... ... ... ... ... ... 50027 000132 2019 10 5 36.5 NaN 13.1 NaN NaN 50028 000133 1991 1 1 NaN NaN NaN 1991 2010 50029 000133 1991 1 2 NaN NaN NaN NaN NaN 50030 000133 1991 1 3 NaN NaN NaN NaN NaN 50031 000133 1991 1 4 NaN NaN NaN NaN NaN ... ... ... ... ... ... ... 54456 000133 2019 10 5 36.5 NaN 12.1 NaN NaN
you can propagate the values in your bad columns with ffill, if the non nan values are always at the first row per group of CODE and your data is ordered per CODE. If not, with groupby.transform and first. Then use mask to replace by nan where the YEAR is between your two bad columns once filled with the wanted value. df_ = df[['BAD_1', 'BAD_2']].ffill() #or more flexible df_ = df.groupby("CODE")[['BAD_1', 'BAD_2']].transform('first') cols = ['TMAX', 'TMIN', 'PP'] df[cols] = df[cols].mask(df['YEAR'].ge(df_['BAD_1']) & df['YEAR'].le(df_['BAD_2'])) print(df) CODE YEAR MONTH DAY TMAX TMIN PP BAD_1 BAD_2 9984 130 1991 1 1 NaN NaN NaN 1991.0 1998.0 9985 130 1991 1 2 NaN NaN NaN NaN NaN 9986 130 1991 1 3 NaN NaN NaN NaN NaN 9987 130 1991 1 4 NaN NaN NaN NaN NaN 9988 130 1991 1 5 NaN NaN NaN NaN NaN 20118 130 2018 9 30 31.8 21.2 NaN NaN NaN 30028 132 1991 1 1 35.2 NaN 0.0 2005.0 2010.0 30029 132 1991 1 2 34.6 NaN 0.0 NaN NaN 30030 132 1991 1 3 35.8 NaN 0.0 NaN NaN 30031 132 1991 1 4 34.8 NaN 0.0 NaN NaN 50027 132 2019 10 5 36.5 NaN 13.1 NaN NaN 50028 133 1991 1 1 NaN NaN NaN 1991.0 2010.0 50029 133 1991 1 2 NaN NaN NaN NaN NaN 50030 133 1991 1 3 NaN NaN NaN NaN NaN 50031 133 1991 1 4 NaN NaN NaN NaN NaN 54456 133 2019 10 5 36.5 NaN 12.1 NaN NaN
In Python, how can I update multiple rows in a DataFrame with a Series?
I have a dataframe as below. a b c d 2010-07-23 NaN NaN NaN NaN 2010-07-26 NaN NaN NaN NaN 2010-07-27 NaN NaN NaN NaN 2010-07-28 NaN NaN NaN NaN 2010-07-29 NaN NaN NaN NaN 2010-07-30 NaN NaN NaN NaN 2010-08-02 NaN NaN NaN NaN 2010-08-03 NaN NaN NaN NaN 2010-08-04 NaN NaN NaN NaN 2010-08-05 NaN NaN NaN NaN And I have a series as below. 2010-07-23 a 1 b 2 c 3 d 4 I want to update the DataFrame with the series as below. How can I do? a b c d 2010-07-23 NaN NaN NaN NaN 2010-07-26 1 2 3 4 2010-07-27 1 2 3 4 2010-07-28 1 2 3 4 2010-07-29 NaN NaN NaN NaN 2010-07-30 NaN NaN NaN NaN 2010-08-02 NaN NaN NaN NaN 2010-08-03 NaN NaN NaN NaN 2010-08-04 NaN NaN NaN NaN 2010-08-05 NaN NaN NaN NaN Thank you very much for the help in advance.
If there is one column instead Series in s add DataFrame.squeeze with concat by length of date range, last pass to DataFrame.update: r = pd.date_range('2010-07-26','2010-07-28') df.update(pd.concat([s.squeeze()] * len(r), axis=1, keys=r).T) print (df) a b c d 2010-07-23 NaN NaN NaN NaN 2010-07-26 1.0 2.0 3.0 4.0 2010-07-27 1.0 2.0 3.0 4.0 2010-07-28 1.0 2.0 3.0 4.0 2010-07-29 NaN NaN NaN NaN 2010-07-30 NaN NaN NaN NaN 2010-08-02 NaN NaN NaN NaN 2010-08-03 NaN NaN NaN NaN 2010-08-04 NaN NaN NaN NaN 2010-08-05 NaN NaN NaN NaN Or you can use np.broadcast_to for repeat Series: r = pd.date_range('2010-07-26','2010-07-28') df1 = pd.DataFrame(np.broadcast_to(s.squeeze().values, (len(r),len(s))), index=r, columns=s.index) print (df1) a b c d 2010-07-26 1 2 3 4 2010-07-27 1 2 3 4 2010-07-28 1 2 3 4 df.update(df1) print (df) a b c d 2010-07-23 NaN NaN NaN NaN 2010-07-26 1.0 2.0 3.0 4.0 2010-07-27 1.0 2.0 3.0 4.0 2010-07-28 1.0 2.0 3.0 4.0 2010-07-29 NaN NaN NaN NaN 2010-07-30 NaN NaN NaN NaN 2010-08-02 NaN NaN NaN NaN 2010-08-03 NaN NaN NaN NaN 2010-08-04 NaN NaN NaN NaN 2010-08-05 NaN NaN NaN NaN
How to split Pandas Series into a DataFrame with columns for each hour of day?
I have a Pandas Series of solar radiation values with the index being timestamps with a one minute resolution. E.g.: index solar_radiation 2019-01-01 08:01 0 2019-01-01 08:02 10 2019-01-01 08:03 15 ... 2019-01-10 23:59 0 I would like to convert this to a table (DataFrame) where each hour is averaged into one column, e.g.: index 00 01 02 03 04 05 06 ... 23 2019-01-01 0 0 0 0 0 3 10 ... 0 2019-01-02 0 0 0 0 0 4 12 ... 0 .... 2019-01-10 0 0 0 0 0 6 24... 0 I have tried to look into Groupby, but there I am only able to group hours into one combined bin and not one for each day... any hints or suggestions as to how I can achive this with groupby or should I just brute force it and iterate over each hour?
If I understand you correctly, you want to use resample hourly. Then we can make a MultiIndex with date and hour, then we unstack the hour index to columns: df = df.resample('H').mean() df.set_index([df.index.date, df.index.time], inplace=True) df = df.unstack(level=[1]) Which gives us the following output: print(df) solar_radiation \ 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 2019-01-01 NaN NaN NaN NaN NaN NaN 2019-01-02 NaN NaN NaN NaN NaN NaN 2019-01-03 NaN NaN NaN NaN NaN NaN 2019-01-04 NaN NaN NaN NaN NaN NaN 2019-01-05 NaN NaN NaN NaN NaN NaN 2019-01-06 NaN NaN NaN NaN NaN NaN 2019-01-07 NaN NaN NaN NaN NaN NaN 2019-01-08 NaN NaN NaN NaN NaN NaN 2019-01-09 NaN NaN NaN NaN NaN NaN 2019-01-10 NaN NaN NaN NaN NaN NaN ... \ 06:00:00 07:00:00 08:00:00 09:00:00 ... 14:00:00 15:00:00 2019-01-01 NaN NaN 8.333333 NaN ... NaN NaN 2019-01-02 NaN NaN NaN NaN ... NaN NaN 2019-01-03 NaN NaN NaN NaN ... NaN NaN 2019-01-04 NaN NaN NaN NaN ... NaN NaN 2019-01-05 NaN NaN NaN NaN ... NaN NaN 2019-01-06 NaN NaN NaN NaN ... NaN NaN 2019-01-07 NaN NaN NaN NaN ... NaN NaN 2019-01-08 NaN NaN NaN NaN ... NaN NaN 2019-01-09 NaN NaN NaN NaN ... NaN NaN 2019-01-10 NaN NaN NaN NaN ... NaN NaN \ 16:00:00 17:00:00 18:00:00 19:00:00 20:00:00 21:00:00 22:00:00 2019-01-01 NaN NaN NaN NaN NaN NaN NaN 2019-01-02 NaN NaN NaN NaN NaN NaN NaN 2019-01-03 NaN NaN NaN NaN NaN NaN NaN 2019-01-04 NaN NaN NaN NaN NaN NaN NaN 2019-01-05 NaN NaN NaN NaN NaN NaN NaN 2019-01-06 NaN NaN NaN NaN NaN NaN NaN 2019-01-07 NaN NaN NaN NaN NaN NaN NaN 2019-01-08 NaN NaN NaN NaN NaN NaN NaN 2019-01-09 NaN NaN NaN NaN NaN NaN NaN 2019-01-10 NaN NaN NaN NaN NaN NaN NaN 23:00:00 2019-01-01 NaN 2019-01-02 NaN 2019-01-03 NaN 2019-01-04 NaN 2019-01-05 NaN 2019-01-06 NaN 2019-01-07 NaN 2019-01-08 NaN 2019-01-09 NaN 2019-01-10 0.0 [10 rows x 24 columns] Note I got a lot of NaN since you provided only couple of rows data.
Solutions for one column DataFrame: Aggregate mean by DatetimeIndex with DatetimeIndex.floor for remove times and DatetimeIndex.hour, reshape by Series.unstack and add missing values by DataFrame.reindex: #if necessary #df.index = pd.to_datetime(df.index) rng = pd.date_range(df.index.min().floor('D'), df.index.max().floor('D')) df1 = (df.groupby([df.index.floor('D'), df.index.hour])['solar_radiation'] .mean() .unstack(fill_value=0) .reindex(columns=range(0, 24), fill_value=0, index=rng)) Another solution with Grouper by hour, replace missing values to 0 and reshape by Series.unstack: #if necessary #df.index = pd.to_datetime(df.index) df1 = df.groupby(pd.Grouper(freq='H'))[['solar_radiation']].mean().fillna(0) df1 = df1.set_index([df1.index.date, df1.index.hour])['solar_radiation'].unstack(fill_value=0) print (df1) 0 1 2 3 4 5 6 7 8 9 ... 14 \ 2019-01-01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.333333 0.0 ... 0.0 2019-01-02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0 2019-01-03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0 2019-01-04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0 2019-01-05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0 2019-01-06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0 2019-01-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0 2019-01-08 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0 2019-01-09 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0 2019-01-10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0 15 16 17 18 19 20 21 22 23 2019-01-01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2019-01-02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2019-01-03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2019-01-04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2019-01-05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2019-01-06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2019-01-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2019-01-08 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2019-01-09 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2019-01-10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [10 rows x 24 columns] Solutions for Series with DatetimeIndex: rng = pd.date_range(df.index.min().floor('D'), df.index.max().floor('D')) df1 = (df.groupby([df.index.floor('D'), df.index.hour]) .mean() .unstack(fill_value=0) .reindex(columns=range(0, 24), fill_value=0, index=rng)) df1 = df.groupby(pd.Grouper(freq='H')).mean().to_frame('new').fillna(0) df1 = df1.set_index([df1.index.date, df1.index.hour])['new'].unstack(fill_value=0)
Is there a way to merge pandas dataframes on row and column index?
I want to merge two pandas data frames that share the same index as well as some columns. pd.merge creates duplicate columns, but I would like to merge on both axes at the same time. tried pd.merge and pd.concat but did not get the right result. my try: df3=pd.merge(df1, df2, left_index=True, right_index=True, how='left') df1 Var#1 Var#2 Var#3 Var#4 Var#5 Var#6 Var#7 ID 323 7 6 8 7.0 2.0 2.0 10.0 324 2 1 5 3.0 4.0 2.0 1.0 675 9 8 1 NaN NaN NaN NaN 676 3 7 2 NaN NaN NaN NaN df2 Var#6 Var#7 Var#8 Var#9 ID 675 1 9 2 8 676 3 2 0 7 ideally I would get: df3 Var#1 Var#2 Var#3 Var#4 Var#5 Var#6 Var#7 Var#8 Var#9 ID 323 7 6 8 7.0 2.0 2.0 10.0 NaN NaN 324 2 1 5 3.0 4.0 2.0 1.0 NaN NaN 675 9 8 1 NaN NaN 1 9 2 8 676 3 7 2 NaN NaN 3 2 0 7
IIUC, use df.combine_first(): df3=df1.combine_first(df2) print(df3) Var#1 Var#2 Var#3 Var#4 Var#5 Var#6 Var#7 Var#8 Var#9 ID 323 7 6 8 7.0 2.0 2.0 10.0 NaN NaN 324 2 1 5 3.0 4.0 2.0 1.0 NaN NaN 675 9 8 1 NaN NaN 1.0 9.0 2.0 8.0 676 3 7 2 NaN NaN 3.0 2.0 0.0 7.0
You can concat and group the data pd.concat([df1, df2], 1).groupby(level = 0, axis = 1).first() Var#1 Var#2 Var#3 Var#4 Var#5 Var#6 Var#7 Var#8 Var#9 ID 323 7.0 6.0 8.0 7.0 2.0 2.0 10.0 NaN NaN 324 2.0 1.0 5.0 3.0 4.0 2.0 1.0 NaN NaN 675 9.0 8.0 1.0 NaN NaN 1.0 9.0 2.0 8.0 676 3.0 7.0 2.0 NaN NaN 3.0 2.0 0.0 7.0