dividing selected columns in pandas - pandas

This is the dataframe:
bins year binA binB binC binD binE binF binG binH
0 1998 4.0 5.0 1.0 1.0 2.0 0.0 1.0 0.0
1 1999 4.0 2.0 1.0 0.0 0.0 4.0 1.0 2.0
2 2000 4.0 1.0 1.0 0.0 4.0 1.0 1.0 2.0
3 2001 2.0 1.0 4.0 1.0 1.0 0.0 2.0 3.0
My goal is to divide binA through binH by sum of binA:binH or for row 1998, divide by the sum of the row excluding the year number.
Sum of desired columns:
newdfdd.loc[:,'binA':'binH'].sum(axis=1)
To get the desired value this is what I have tried:
newdfdd[['binA','binB','binC','binD','binE',
'binF','binG' ,'binH']].div(newdfdd.loc[:,'binA':'binH'].sum(axis=1))
But, I get NaN and four extra columns as following:
0 1 2 3 binA binB binC binD binE binF binG binH
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I want results in the following format:
bins year binA binB binC binD binE binF binG binH
0 1998 0.285 0.357 ... .... .... .... ... ...
1 1999 .. .. .. .. .. .. .. ..
.... means some number from calculation.
What do I need to edit in my code for the desired output?

In the div statement you need to provide the axis='index' and it should get the result you are looking for.
So your code above should look like:
newdfdd.update(newdfdd.loc[:,'binA':'binH'].div(newdfdd.loc[:,'binA':'binH'].sum(axis=1),
axis='index'))
This will compute your percent of row sum as desired and then update the values inplace in the newfdd dataframe.
Here is the entirety of my solution for clarity (I used df, and random variables, but the rest is the same):
df = pd.DataFrame({'bins':[0,1,2,3],
'year':[1998,1999,2000,2001],
'binA':np.random.randint(1,10,4),
'binB':np.random.randint(1,10,4),
'binC':np.random.randint(1,10,4),
'binD':np.random.randint(1,10,4),
'binE':np.random.randint(1,10,4),
'binF':np.random.randint(1,10,4),
'binG':np.random.randint(1,10,4),
'binH':np.random.randint(1,10,4)})
#reodering columns to match your dataframe layout
df = df[['bins','year','binA','binB','binC','binD','binE',
'binF','binG' ,'binH']]
df.update(df.loc[:,'binA':'binH'].div(df.loc[:,'binA':'binH'].sum(axis=1),axis='index'))
print(df)
bins year binA binB binC binD binE binF binG binH
0 0 1998 0.222222 0.037037 0.148148 0.185185 0.037037 0.111111 0.037037 0.222222
1 1 1999 0.264706 0.058824 0.205882 0.058824 0.029412 0.147059 0.176471 0.058824
2 2 2000 0.166667 0.041667 0.145833 0.020833 0.166667 0.166667 0.145833 0.145833
3 3 2001 0.062500 0.187500 0.020833 0.145833 0.083333 0.166667 0.166667 0.166667

I think this is the result you are looking for:
df['rowSum'] = df[df.columns[2:]].apply(sum, axis=1)
df[df.columns[2:]].apply(lambda x: (x / x['rowSum']), axis=1).drop(columns=['rowSum'])
binA binB binC binD binE binF binG binH
0 0.285714 0.357143 0.071429 0.071429 0.142857 0.000000 0.071429 0.000000
1 0.285714 0.142857 0.071429 0.000000 0.000000 0.285714 0.071429 0.142857
2 0.285714 0.071429 0.071429 0.000000 0.285714 0.071429 0.071429 0.142857
3 0.142857 0.071429 0.285714 0.071429 0.071429 0.000000 0.142857 0.214286

Related

Why cannot I have a usual dataframe after using pivot()?

Under the variable names, there is an extra row that I do not want in my data set
fdi_autocracy = fdi_autocracy.pivot(index=["Country", "regime", "Year"],
columns="partner_regime",
values =['FDI_outward', "FDI_inward", "total_fdi"],
).reset_index()
Country regime Year FDI_outward FDI_inward total_fdi
partner_regime 0.0 0.0 0.0
0 Albania 0.0 1995 NaN NaN NaN
1 Albania 0.0 1996 NaN NaN NaN
2 Albania 0.0 1997 NaN NaN NaN
3 Albania 0.0 1998 NaN NaN NaN
4 Albania 0.0 1999 NaN NaN NaN
What I want is following:
Country regime Year FDI_outward FDI_inward total_fdi
0 Albania 0.0 1995 NaN NaN NaN
1 Albania 0.0 1996 NaN NaN NaN
2 Albania 0.0 1997 NaN NaN NaN
3 Albania 0.0 1998 NaN NaN NaN
4 Albania 0.0 1999 NaN NaN NaN
IIUC, you don't need the partner_regime?
this removes that title
fdi_autocracy.rename_axis(columns=[None, None])

How to remove periods of time in a dataframe?

I have this df:
CODE YEAR MONTH DAY TMAX TMIN PP BAD PERIOD 1 BAD PERIOD 2
9984 000130 1991 1 1 32.6 23.4 0.0 1991 1998
9985 000130 1991 1 2 31.2 22.4 0.0 NaN NaN
9986 000130 1991 1 3 32.0 NaN 0.0 NaN NaN
9987 000130 1991 1 4 32.2 23.0 0.0 NaN NaN
9988 000130 1991 1 5 30.5 22.0 0.0 NaN NaN
... ... ... ... ... ... ...
20118 000130 2018 9 30 31.8 21.2 NaN NaN NaN
30028 000132 1991 1 1 35.2 NaN 0.0 2005 2010
30029 000132 1991 1 2 34.6 NaN 0.0 NaN NaN
30030 000132 1991 1 3 35.8 NaN 0.0 NaN NaN
30031 000132 1991 1 4 34.8 NaN 0.0 NaN NaN
... ... ... ... ... ... ...
50027 000132 2019 10 5 36.5 NaN 13.1 NaN NaN
50028 000133 1991 1 1 36.2 NaN 0.0 1991 2010
50029 000133 1991 1 2 36.6 NaN 0.0 NaN NaN
50030 000133 1991 1 3 36.8 NaN 5.0 NaN NaN
50031 000133 1991 1 4 36.8 NaN 0.0 NaN NaN
... ... ... ... ... ... ...
54456 000133 2019 10 5 36.5 NaN 12.1 NaN NaN
I want to change the values ​​of the columns TMAX TMIN and PP to NaN, only of the periods specified in Bad Period 1 and Bad period 2 AND ONLY IN THEIR RESPECTIVE CODE. For example if I have Bad Period 1 equal to 1991 and Bad period 2 equal to 1998 I want all the values of TMAX, TMIN and PP that have code 000130 have NaN values since 1991 (bad period 1) to 1998 (bad period 2). I have 371 unique CODES in CODE column so i might use df.groupby("CODE").
Expected result after the change:
CODE YEAR MONTH DAY TMAX TMIN PP BAD PERIOD 1 BAD PERIOD 2
9984 000130 1991 1 1 NaN NaN NaN 1991 1998
9985 000130 1991 1 2 NaN NaN NaN NaN NaN
9986 000130 1991 1 3 NaN NaN NaN NaN NaN
9987 000130 1991 1 4 NaN NaN NaN NaN NaN
9988 000130 1991 1 5 NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
20118 000130 2018 9 30 31.8 21.2 NaN NaN NaN
30028 000132 1991 1 1 35.2 NaN 0.0 2005 2010
30029 000132 1991 1 2 34.6 NaN 0.0 NaN NaN
30030 000132 1991 1 3 35.8 NaN 0.0 NaN NaN
30031 000132 1991 1 4 34.8 NaN 0.0 NaN NaN
... ... ... ... ... ... ...
50027 000132 2019 10 5 36.5 NaN 13.1 NaN NaN
50028 000133 1991 1 1 NaN NaN NaN 1991 2010
50029 000133 1991 1 2 NaN NaN NaN NaN NaN
50030 000133 1991 1 3 NaN NaN NaN NaN NaN
50031 000133 1991 1 4 NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
54456 000133 2019 10 5 36.5 NaN 12.1 NaN NaN
you can propagate the values in your bad columns with ffill, if the non nan values are always at the first row per group of CODE and your data is ordered per CODE. If not, with groupby.transform and first. Then use mask to replace by nan where the YEAR is between your two bad columns once filled with the wanted value.
df_ = df[['BAD_1', 'BAD_2']].ffill()
#or more flexible df_ = df.groupby("CODE")[['BAD_1', 'BAD_2']].transform('first')
cols = ['TMAX', 'TMIN', 'PP']
df[cols] = df[cols].mask(df['YEAR'].ge(df_['BAD_1'])
& df['YEAR'].le(df_['BAD_2']))
print(df)
CODE YEAR MONTH DAY TMAX TMIN PP BAD_1 BAD_2
9984 130 1991 1 1 NaN NaN NaN 1991.0 1998.0
9985 130 1991 1 2 NaN NaN NaN NaN NaN
9986 130 1991 1 3 NaN NaN NaN NaN NaN
9987 130 1991 1 4 NaN NaN NaN NaN NaN
9988 130 1991 1 5 NaN NaN NaN NaN NaN
20118 130 2018 9 30 31.8 21.2 NaN NaN NaN
30028 132 1991 1 1 35.2 NaN 0.0 2005.0 2010.0
30029 132 1991 1 2 34.6 NaN 0.0 NaN NaN
30030 132 1991 1 3 35.8 NaN 0.0 NaN NaN
30031 132 1991 1 4 34.8 NaN 0.0 NaN NaN
50027 132 2019 10 5 36.5 NaN 13.1 NaN NaN
50028 133 1991 1 1 NaN NaN NaN 1991.0 2010.0
50029 133 1991 1 2 NaN NaN NaN NaN NaN
50030 133 1991 1 3 NaN NaN NaN NaN NaN
50031 133 1991 1 4 NaN NaN NaN NaN NaN
54456 133 2019 10 5 36.5 NaN 12.1 NaN NaN

Add header to .data file in Pandas

Given a file with the extention of .data, I have read it with pd.read_fwf("./input.data", sep=",", header = None):
Out:
0
0 63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3...
1 67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5...
2 67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6...
3 37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5...
4 41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4...
... ...
292 57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2...
293 45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2...
294 68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4...
295 57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2...
296 57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0...
How can I add the following column names to it? Thanks.
col_names = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg",
"thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"]
Update:
pd.read_fwf("./input.data", names = col_names)
Out:
age sex cp restbp chol fbs restecg thalach exang oldpeak slope ca thal num
0 63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
292 57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
293 45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
294 68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
295 57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
296 57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0... NaN NaN NaN NaN NaN NaN
If check read_fwf:
Read a table of fixed-width formatted lines into DataFrame.
So if there is separator , use read_csv:
col_names = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg",
"thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"]
df = pd.read_csv("input.data", names=col_names)
print (df)
age sex cp restbp chol fbs restecg thalach exang oldpeak \
0 63.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3
1 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5
2 67.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6
3 37.0 1.0 3.0 130.0 250.0 0.0 0.0 187.0 0.0 3.5
4 41.0 0.0 2.0 130.0 204.0 0.0 2.0 172.0 0.0 1.4
.. ... ... ... ... ... ... ... ... ... ...
292 57.0 0.0 4.0 140.0 241.0 0.0 0.0 123.0 1.0 0.2
293 45.0 1.0 1.0 110.0 264.0 0.0 0.0 132.0 0.0 1.2
294 68.0 1.0 4.0 144.0 193.0 1.0 0.0 141.0 0.0 3.4
295 57.0 1.0 4.0 130.0 131.0 0.0 0.0 115.0 1.0 1.2
296 57.0 0.0 2.0 130.0 236.0 0.0 2.0 174.0 0.0 0.0
slope ca thal num
0 3.0 0.0 6.0 0
1 2.0 3.0 3.0 1
2 2.0 2.0 7.0 1
3 3.0 0.0 3.0 0
4 1.0 0.0 3.0 0
.. ... ... ... ...
292 2.0 0.0 7.0 1
293 2.0 0.0 7.0 1
294 2.0 2.0 7.0 1
295 2.0 1.0 7.0 1
296 2.0 1.0 3.0 1
[297 rows x 14 columns]
Just do a read_csv without header and pass col_names:
df = pd.read_csv('input.data', header=None, names=col_names);
Output (head):
age sex cp restbp chol fbs restecg thalach exang oldpeak slope ca thal num
-- ----- ----- ---- -------- ------ ----- --------- --------- ------- --------- ------- ---- ------ -----
0 63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
1 67 1 4 160 286 0 2 108 1 1.5 2 3 3 1
2 67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
3 37 1 3 130 250 0 0 187 0 3.5 3 0 3 0
4 41 0 2 130 204 0 2 172 0 1.4 1 0 3 0

How to get proportions of different observation types per total and per year in pandas

I am not entirely new to data science, but rather novice with pandas.
My data looks like this:
Date Obser_Type
0 2001-01-05 A
1 2002-02-06 A
2 2002-02-06 B
3 2004-03-07 C
4 2005-04-08 B
5 2006-05-09 A
6 2007-06-10 C
7 2007-07-11 B
I would like to get the following output with the proportions for the different kinds of observations as of total (i.e. accumulated from the beginning up to and including the specified year) and within each year:
Year A_%_total B_%_total C_%_total A_%_Year B_%_Year C_%_Year
0 2001 100 0 0 100 0 0
1 2002 67 33 0 50 50 0
2 2004 50 25 25 0 0 100
3 2005 40 40 20 0 100 0
4 2006 50 33 17 100 0 0
5 2007 37,5 37,5 25 0 50 50
I tried various approaches involving groupby, multiindexing, count etc but to no avail. I got either errors or something unsatisfying.
After extensively digging Stack Overflow and the rest of the internet for days, I am stumped.
The medieval way would be a bucket of loops and ifs, but what is the proper way to do this?
I have used appropriate values for the numbers. I don't know the aggregation logic of each of them, but I decided to create a composition ratio for 'Obser_Type' and a composition ratio for 'year'.
Add a new column for year data
2.Aggregate and create DF
3.Creating the Composition Ratio
4.Aggregate and create DF
5.Creating the Composition Ratio
6.Combining the two DF's
import pandas as pd
import numpy as np
import io
data = '''
Date Obser_Type Value
0 2001-01-05 A 34
1 2002-02-06 A 39
2 2002-02-06 B 67
3 2004-03-07 C 20
4 2005-04-08 B 29
5 2006-05-09 A 10
6 2007-06-10 C 59
7 2007-07-11 B 43
'''
df = pd.read_csv(io.StringIO(data), sep=' ')
df['Date'] = pd.to_datetime(df['Date'])
df['yyyy'] = df['Date'].dt.year
df1 = df.groupby(['yyyy','Obser_Type'])['Value'].agg(sum).unstack().fillna(0)
df1 = df1.apply(lambda x: x/sum(x), axis=0).rename(columns={'A':'A_%_total','B':'B_%_total','C':'C_%_total'})
df2 = df.groupby(['Obser_Type','yyyy'])['Value'].agg(sum).unstack().fillna(0)
df2 = df2.apply(lambda x: x/sum(x), axis=0)
df2 = df2.unstack().unstack().rename(columns={'A':'A_%_Year','B':'B_%_Year','C':'C_%_Year'})
pd.merge(df1, df2, on='yyyy')
Obser_Type A_%_total B_%_total C_%_total A_%_Year B_%_Year C_%_Year
yyyy
2001 0.409639 0.000000 0.000000 1.000000 0.000000 0.000000
2002 0.469880 0.482014 0.000000 0.367925 0.632075 0.000000
2004 0.000000 0.000000 0.253165 0.000000 0.000000 1.000000
2005 0.000000 0.208633 0.000000 0.000000 1.000000 0.000000
2006 0.120482 0.000000 0.000000 1.000000 0.000000 0.000000
2007 0.000000 0.309353 0.746835 0.000000 0.421569 0.578431
Thank you very much for your answer. However, i probably should have made it more clear that the actual dataframe is much bigger and has much more types of observations than A B C, so listing them manually would be inconvenient. My scope here is just the statistics for the different types of observations, not their associated numerical values.
I was able to build something and would like to share:
# convert dates to datetimes
#
df[‚Date'] = pd.to_datetime(df[‚Date'])
# get years from the dates
#
df[‚Year'] = df.Date.dt.year
# get total number of observations per type of observation and year in tabular form
#
grouped = df.groupby(['Year', 'Obser_Type']).count().unstack(1)
Date
Obser_Type A B C
Year
2001 1.0 NaN NaN
2002 1.0 1.0 NaN
2004 NaN NaN 1.0
2005 NaN 1.0 NaN
2006 1.0 NaN NaN
2007 NaN 1.0 1.0
# sum total number of observations per type over all years
#
grouped.loc['Total_Obs_per_Type',:] = grouped.sum(axis=0)
Date
Obser_Type A B C
Year
2001 1.0 NaN NaN
2002 1.0 1.0 NaN
2004 NaN NaN 1.0
2005 NaN 1.0 NaN
2006 1.0 NaN NaN
2007 NaN 1.0 1.0
Total_Obs_per_Type 3.0 3.0 2.0
# at this point the columns have a multiindex
#
grouped.columns
MultiIndex([('Date', 'A'),
('Date', 'B'),
('Date', 'C')],
names=[None, 'Obser_Type'])
# i only needed the second layer which looks like this
#
grouped.columns.get_level_values(1)
Index(['A', 'B', 'C'], dtype='object', name='Obser_Type')
# so i flattened the index
#
grouped.columns = grouped.columns.get_level_values(1)
# now i can easily address the columns
#
grouped.columns
Index(['A', 'B', 'C'], dtype='object', name='Obser_Type')
# create list of columns with observation types
# this refers to columns "A B C"
#
types_list = grouped.columns.values.tolist()
# create list to later access the columns with the cumulative sum of observations per type
# this refers to columns "A_cum B_cum C_cum"
#
types_cum_list = []
# calculate cumulative sum for the different kinds of observations
#
for columnName in types_list:
# create new columns with modified name and calculate for each type of observation the cumulative sum of observations
#
grouped[columnName+'_cum'] = grouped[columnName].cumsum()
# put the new column names in the list of columns with cumulative sum of observations per type
#
types_cum_list.append(columnName+'_cum')
# this gives
Obser_Type A B C A_cum B_cum C_cum
Year
2001 1.0 NaN NaN 1.0 NaN NaN
2002 1.0 1.0 NaN 2.0 1.0 NaN
2004 NaN NaN 1.0 NaN NaN 1.0
2005 NaN 1.0 NaN NaN 2.0 NaN
2006 1.0 NaN NaN 3.0 NaN NaN
2007 NaN 1.0 1.0 NaN 3.0 2.0
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0
# create new column with total number of observations for all types of observation within a single year
#
grouped['All_Obs_Y'] = grouped.loc[:,types_list].sum(axis=1)
# this gives
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y
Year
2001 1.0 NaN NaN 1.0 NaN NaN 1.0
2002 1.0 1.0 NaN 2.0 1.0 NaN 2.0
2004 NaN NaN 1.0 NaN NaN 1.0 1.0
2005 NaN 1.0 NaN NaN 2.0 NaN 1.0
2006 1.0 NaN NaN 3.0 NaN NaN 1.0
2007 NaN 1.0 1.0 NaN 3.0 2.0 2.0
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0
# create new columns with cumulative sum of all kinds observations up to each year
#
grouped['All_Obs_Cum'] = grouped['All_Obs_Y'].cumsum()
# this gives
# sorry i could not work out the formatting and i am not allowed yet to include screenshots
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y All_Obs_Cum
Year
2001 1.0 NaN NaN 1.0 NaN NaN 1.0 1.0
2002 1.0 1.0 NaN 2.0 1.0 NaN 2.0 3.0
2004 NaN NaN 1.0 NaN NaN 1.0 1.0 4.0
2005 NaN 1.0 NaN NaN 2.0 NaN 1.0 5.0
2006 1.0 NaN NaN 3.0 NaN NaN 1.0 6.0
2007 NaN 1.0 1.0 NaN 3.0 2.0 2.0 8.0
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0 16.0
# create list of columns with the percentages each type of observation has within the observations of each year
# this refers to columns "A_%_Y B_%_Y C_Y_%"
#
types_percent_Y_list = []
# calculate the percentages each type of observation has within each year
#
for columnName in types_list:
# calculate percentages
#
grouped[columnName+'_%_Y'] = grouped[columnName] / grouped['All_Obs_Y']
# put the new columns names in list of columns with percentages each type of observation has within a year for later access
#
types_percent_Y_list.append(columnName+'_%_Y')
# this gives
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y All_Obs_Cum A_%_Y B_%_Y C_%_Y
Year
2001 1.0 NaN NaN 1.0 NaN NaN 1.0 1.0 1.000 NaN NaN
2002 1.0 1.0 NaN 2.0 1.0 NaN 2.0 3.0 0.500 0.500 NaN
2004 NaN NaN 1.0 NaN NaN 1.0 1.0 4.0 NaN NaN 1.00
2005 NaN 1.0 NaN NaN 2.0 NaN 1.0 5.0 NaN 1.000 NaN
2006 1.0 NaN NaN 3.0 NaN NaN 1.0 6.0 1.000 NaN NaN
2007 NaN 1.0 1.0 NaN 3.0 2.0 2.0 8.0 NaN 0.500 0.50
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0 16.0 0.375 0.375 0.25
# replace the NaNs in the types_cum columns, otherwise the calculation of the cumulative percentages in the next step would not work
#
# types_cum_list :
# if there is no observation for e.g. type B in the first year (2001) we put a count of 0 for that year,
# that is, in the first row.
# If there is no observation for type B in a later year (e.g. 2004) the cumulative count of Bs
# from the beginning up to that year does not change in that year, so we replace the NaN there with
# the last non-NaN value preceding it
#
# replace NaNs in first row by 0
#
for columnName in types_cum_list:
grouped.update(grouped.iloc[:1][columnName].fillna(value=0))
# replace NaNs in later rows with preceding non-NaN value
#
for columnName in types_cum_list:
grouped[columnName].fillna(method='ffill' , inplace=True)
# this gives
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y All_Obs_Cum A_%_Y B_%_Y C_%_Y
Year
2001 1.0 NaN NaN 1.0 0.0 0.0 1.0 1.0 1.000 NaN NaN
2002 1.0 1.0 NaN 2.0 1.0 0.0 2.0 3.0 0.500 0.500 NaN
2004 NaN NaN 1.0 2.0 1.0 1.0 1.0 4.0 NaN NaN 1.00
2005 NaN 1.0 NaN 2.0 2.0 1.0 1.0 5.0 NaN 1.000 NaN
2006 1.0 NaN NaN 3.0 2.0 1.0 1.0 6.0 1.000 NaN NaN
2007 NaN 1.0 1.0 3.0 3.0 2.0 2.0 8.0 NaN 0.500 0.50
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0 16.0 0.375 0.375 0.25
# create list of the columns with the cumulative percentages of the different observation types from the beginning up to that year
# this refers to columns "A_cum_% B_cum_% C_cum_%"
#
types_cum_percent_list = []
# calculate cumulative proportions of different types of observations from beginning up to each year
#
for columnName in types_cum_list:
# if we had not taken care of the NaNs in the types_cum columns this would produce incorrect numbers
#
grouped[columnName+'_%'] = grouped[columnName] / grouped['All_Obs_Cum']
# put the new columns in their respective list so we can access them conveniently later
#
types_cum_percent_list.append(columnName+'_%')
# this gives
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y All_Obs_Cum A_%_Y B_%_Y C_%_Y A_cum_% B_cum_% C_cum_%
Year
2001 1.0 NaN NaN 1.0 0.0 0.0 1.0 1.0 1.000 NaN NaN 1.000000 0.000000 0.000000
2002 1.0 1.0 NaN 2.0 1.0 0.0 2.0 3.0 0.500 0.500 NaN 0.666667 0.333333 0.000000
2004 NaN NaN 1.0 2.0 1.0 1.0 1.0 4.0 NaN NaN 1.00 0.500000 0.250000 0.250000
2005 NaN 1.0 NaN 2.0 2.0 1.0 1.0 5.0 NaN 1.000 NaN 0.400000 0.400000 0.200000
2006 1.0 NaN NaN 3.0 2.0 1.0 1.0 6.0 1.000 NaN NaN 0.500000 0.333333 0.166667
2007 NaN 1.0 1.0 3.0 3.0 2.0 2.0 8.0 NaN 0.500 0.50 0.375000 0.375000 0.250000
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0 16.0 0.375 0.375 0.25 0.375000 0.375000 0.250000
# to conclude i replace the remaining NaNs to make plotting easier
# replace NaNs in columns in types_list
#
# if there is no observation for a type of observation in a year we put a count of 0 for that year
#
for columnName in types_list:
grouped[columnName].fillna(value=0, inplace=True)
# replace NaNs in columns in types_percent_Y_list
#
# if there is no observation for a type of observation in a year we put a percentage of 0 for that year
#
for columnName in types_percent_Y_list:
grouped[columnName].fillna(value=0, inplace=True)
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y All_Obs_Cum A_%_Y B_%_Y C_%_Y A_cum_% B_cum_% C_cum_%
Year
2001 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.000 0.000 0.00 1.000000 0.000000 0.000000
2002 1.0 1.0 0.0 2.0 1.0 0.0 2.0 3.0 0.500 0.500 0.00 0.666667 0.333333 0.000000
2004 0.0 0.0 1.0 2.0 1.0 1.0 1.0 4.0 0.000 0.000 1.00 0.500000 0.250000 0.250000
2005 0.0 1.0 0.0 2.0 2.0 1.0 1.0 5.0 0.000 1.000 0.00 0.400000 0.400000 0.200000
2006 1.0 0.0 0.0 3.0 2.0 1.0 1.0 6.0 1.000 0.000 0.00 0.500000 0.333333 0.166667
2007 0.0 1.0 1.0 3.0 3.0 2.0 2.0 8.0 0.000 0.500 0.50 0.375000 0.375000 0.250000
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0 16.0 0.375 0.375 0.25 0.375000 0.375000 0.250000
This has the functionylity and flexibility i was looking for. But as i am still learning pandas suggestions for improvement are appreciated.

How to split Pandas Series into a DataFrame with columns for each hour of day?

I have a Pandas Series of solar radiation values with the index being timestamps with a one minute resolution. E.g.:
index solar_radiation
2019-01-01 08:01 0
2019-01-01 08:02 10
2019-01-01 08:03 15
...
2019-01-10 23:59 0
I would like to convert this to a table (DataFrame) where each hour is averaged into one column, e.g.:
index 00 01 02 03 04 05 06 ... 23
2019-01-01 0 0 0 0 0 3 10 ... 0
2019-01-02 0 0 0 0 0 4 12 ... 0
....
2019-01-10 0 0 0 0 0 6 24... 0
I have tried to look into Groupby, but there I am only able to group hours into one combined bin and not one for each day... any hints or suggestions as to how I can achive this with groupby or should I just brute force it and iterate over each hour?
If I understand you correctly, you want to use resample hourly. Then we can make a MultiIndex with date and hour, then we unstack the hour index to columns:
df = df.resample('H').mean()
df.set_index([df.index.date, df.index.time], inplace=True)
df = df.unstack(level=[1])
Which gives us the following output:
print(df)
solar_radiation \
00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00
2019-01-01 NaN NaN NaN NaN NaN NaN
2019-01-02 NaN NaN NaN NaN NaN NaN
2019-01-03 NaN NaN NaN NaN NaN NaN
2019-01-04 NaN NaN NaN NaN NaN NaN
2019-01-05 NaN NaN NaN NaN NaN NaN
2019-01-06 NaN NaN NaN NaN NaN NaN
2019-01-07 NaN NaN NaN NaN NaN NaN
2019-01-08 NaN NaN NaN NaN NaN NaN
2019-01-09 NaN NaN NaN NaN NaN NaN
2019-01-10 NaN NaN NaN NaN NaN NaN
... \
06:00:00 07:00:00 08:00:00 09:00:00 ... 14:00:00 15:00:00
2019-01-01 NaN NaN 8.333333 NaN ... NaN NaN
2019-01-02 NaN NaN NaN NaN ... NaN NaN
2019-01-03 NaN NaN NaN NaN ... NaN NaN
2019-01-04 NaN NaN NaN NaN ... NaN NaN
2019-01-05 NaN NaN NaN NaN ... NaN NaN
2019-01-06 NaN NaN NaN NaN ... NaN NaN
2019-01-07 NaN NaN NaN NaN ... NaN NaN
2019-01-08 NaN NaN NaN NaN ... NaN NaN
2019-01-09 NaN NaN NaN NaN ... NaN NaN
2019-01-10 NaN NaN NaN NaN ... NaN NaN
\
16:00:00 17:00:00 18:00:00 19:00:00 20:00:00 21:00:00 22:00:00
2019-01-01 NaN NaN NaN NaN NaN NaN NaN
2019-01-02 NaN NaN NaN NaN NaN NaN NaN
2019-01-03 NaN NaN NaN NaN NaN NaN NaN
2019-01-04 NaN NaN NaN NaN NaN NaN NaN
2019-01-05 NaN NaN NaN NaN NaN NaN NaN
2019-01-06 NaN NaN NaN NaN NaN NaN NaN
2019-01-07 NaN NaN NaN NaN NaN NaN NaN
2019-01-08 NaN NaN NaN NaN NaN NaN NaN
2019-01-09 NaN NaN NaN NaN NaN NaN NaN
2019-01-10 NaN NaN NaN NaN NaN NaN NaN
23:00:00
2019-01-01 NaN
2019-01-02 NaN
2019-01-03 NaN
2019-01-04 NaN
2019-01-05 NaN
2019-01-06 NaN
2019-01-07 NaN
2019-01-08 NaN
2019-01-09 NaN
2019-01-10 0.0
[10 rows x 24 columns]
Note I got a lot of NaN since you provided only couple of rows data.
Solutions for one column DataFrame:
Aggregate mean by DatetimeIndex with DatetimeIndex.floor for remove times and DatetimeIndex.hour, reshape by Series.unstack and add missing values by DataFrame.reindex:
#if necessary
#df.index = pd.to_datetime(df.index)
rng = pd.date_range(df.index.min().floor('D'), df.index.max().floor('D'))
df1 = (df.groupby([df.index.floor('D'), df.index.hour])['solar_radiation']
.mean()
.unstack(fill_value=0)
.reindex(columns=range(0, 24), fill_value=0, index=rng))
Another solution with Grouper by hour, replace missing values to 0 and reshape by Series.unstack:
#if necessary
#df.index = pd.to_datetime(df.index)
df1 = df.groupby(pd.Grouper(freq='H'))[['solar_radiation']].mean().fillna(0)
df1 = df1.set_index([df1.index.date, df1.index.hour])['solar_radiation'].unstack(fill_value=0)
print (df1)
0 1 2 3 4 5 6 7 8 9 ... 14 \
2019-01-01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.333333 0.0 ... 0.0
2019-01-02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
2019-01-03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
2019-01-04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
2019-01-05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
2019-01-06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
2019-01-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
2019-01-08 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
2019-01-09 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
2019-01-10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
15 16 17 18 19 20 21 22 23
2019-01-01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-08 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-09 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[10 rows x 24 columns]
Solutions for Series with DatetimeIndex:
rng = pd.date_range(df.index.min().floor('D'), df.index.max().floor('D'))
df1 = (df.groupby([df.index.floor('D'), df.index.hour])
.mean()
.unstack(fill_value=0)
.reindex(columns=range(0, 24), fill_value=0, index=rng))
df1 = df.groupby(pd.Grouper(freq='H')).mean().to_frame('new').fillna(0)
df1 = df1.set_index([df1.index.date, df1.index.hour])['new'].unstack(fill_value=0)