Find top 3 highest values across 3 columns row-wise pandas

Find top 3 highest values across 3 columns row-wise pandas - pandas

I have four columns of values consisting of the looking times to a right/left/upper/lower positioned image. Another column shows the position of the image which was chosen (decision_resp). I created a new column showing the looking time of the chosen image. Now I want to create 3 more columns showing the looking times of the not chosen images sorted by highest looking time (top1), second highest looking time (toop2) and third highest looking time (top 3). The looking time of the chosen image has to be excluded.
These are the columns I have:
lookRight_t lookLeft_t lookUp_t lookDown_t decision_resp chosen_img_et
0 1.291667 1.325000 3.025000 1.141667 up 3.025000
1 0.000000 0.000000 1.125000 3.150000 down 3.150000
2 0.000000 0.000000 3.508333 2.275000 up 3.508333
3 3.700000 1.950000 0.000000 0.000000 right 3.700000
4 2.633333 1.316667 1.341667 0.000000 right 2.633333
5 1.766667 1.333333 0.825000 2.208333 down 2.208333
6 0.000000 0.000000 1.108333 5.283333 down 5.283333
My approach was:
# create new column for looking time of chosen image
trials.loc[trials['decision_resp']=='right','chosen_img_et'] = trials['lookRight_t']
trials.loc[trials['decision_resp']=='left','chosen_img_et'] = trials['lookLeft_t']
trials.loc[trials['decision_resp']=='down','chosen_img_et'] = trials['lookDown_t']
trials.loc[trials['decision_resp']=='up','chosen_img_et'] = trials['lookUp_t']
# here I got stuck
trials.loc[trials['decision_resp']=='right', 3 new columns (top1/2/3)] = trials[['lookLeft_t', 'lookDown_t', 'lookUp_t']].find max values and put it in order
trials.loc[trials['decision_resp']=='left', 3 new columns (top1/2/3)] = trials[['lookRight_t', 'lookDown_t', 'lookUp_t']].find max values and put it in order
trials.loc[trials['decision_resp']=='down', 3 new columns (top1/2/3)] = trials[['lookLeft_t', 'lookRight_t', 'lookUp_t']].find max values and put it in order
trials.loc[trials['decision_resp']=='up', 3 new columns (top1/2/3)] = trials[['lookLeft_t', 'lookDown_t', 'lookRight_t']].find max values and put it in order
Thank you for any help!

First you can use Looking up solution for new column chosen_img_et
You can sorting selected columns by numpy.sort and then use indexing for top3 rows without selected value, so is compared columns names by column decision_resp with broadcasting and set missing values by mask by DataFrame.mask:
cols = ['lookRight_t','lookLeft_t','lookUp_t','lookDown_t']
#replace substrings for match column decision_resp
look = [x.replace('look','').replace('_t','').lower() for x in cols]
new = [f'top{x+1}' for x in range(3)]
#lookup
idx1, cols1 = pd.factorize(trials['decision_resp'])
trials['chosen_img_et'] = (trials[cols].set_axis(look, axis=1)
.reindex(cols1, axis=1)
.to_numpy()[np.arange(len(trials)), idx1])
mask = np.array(look) == trials['decision_resp'].to_numpy()[:, None]
#np.sort sorting by default descending,
#so for ascending order use -1 in indexing and 2 is for remove first only NaN column
trials[new] = np.sort(trials[cols].mask(mask), axis=1)[:, 2::-1]
print (trials)
lookRight_t lookLeft_t lookUp_t lookDown_t decision_resp chosen_img_et \
0 1.291667 1.325000 3.025000 1.141667 up 3.025000
1 0.000000 0.000000 1.125000 3.150000 down 3.150000
2 0.000000 0.000000 3.508333 2.275000 up 3.508333
3 3.700000 1.950000 0.000000 0.000000 right 3.700000
4 2.633333 1.316667 1.341667 0.000000 right 2.633333
5 1.766667 1.333333 0.825000 2.208333 down 2.208333
6 0.000000 0.000000 1.108333 5.283333 down 5.283333
top1 top2 top3
0 1.325000 1.291667 1.141667
1 1.125000 0.000000 0.000000
2 2.275000 0.000000 0.000000
3 1.950000 0.000000 0.000000
4 1.341667 1.316667 0.000000
5 1.766667 1.333333 0.825000
6 1.108333 0.000000 0.000000
Details:
print (trials[cols].mask(mask))
lookRight_t lookLeft_t lookUp_t lookDown_t
0 1.291667 1.325000 NaN 1.141667
1 0.000000 0.000000 1.125000 NaN
2 0.000000 0.000000 NaN 2.275000
3 NaN 1.950000 0.000000 0.000000
4 NaN 1.316667 1.341667 0.000000
5 1.766667 1.333333 0.825000 NaN
6 0.000000 0.000000 1.108333 NaN

Find the max column name and extract the direction from it.
Find the max for each row.
Find the remaining 3 values and sort them as desired.
Concat the results together.
Renameing columns as I go.
decision_resp = df.idxmax(axis=1).str.extract('look(\w*)_t', expand=False)
decision_resp.rename('decision_resp', inplace=True)
chosen_img_et = df.max(axis=1, numeric_only=True)
chosen_img_et.rename('chosen_img_et', inplace=True)
top3 = df.apply(lambda x: x.nlargest(4).sort_values(ascending=False, ignore_index=True)[1:], axis=1)
top3.columns = ['top1', 'top2', 'top3']
df = pd.concat([df, decision_resp, chosen_img_et, top3], axis=1)
print(df)
Output:
lookRight_t lookLeft_t lookUp_t lookDown_t decision_resp chosen_img_et \
0 1.291667 1.325000 3.025000 1.141667 Up 3.025000
1 0.000000 0.000000 1.125000 3.150000 Down 3.150000
2 0.000000 0.000000 3.508333 2.275000 Up 3.508333
3 3.700000 1.950000 0.000000 0.000000 Right 3.700000
4 2.633333 1.316667 1.341667 0.000000 Right 2.633333
5 1.766667 1.333333 0.825000 2.208333 Down 2.208333
6 0.000000 0.000000 1.108333 5.283333 Down 5.283333
top1 top2 top3
0 1.325000 1.291667 1.141667
1 1.125000 0.000000 0.000000
2 2.275000 0.000000 0.000000
3 1.950000 0.000000 0.000000
4 1.341667 1.316667 0.000000
5 1.766667 1.333333 0.825000
6 1.108333 0.000000 0.000000
Other way, addressing jezrael's concerns:
col_list = ['lookRight_t', 'lookLeft_t', 'lookUp_t', 'lookDown_t']
idx, cols = pd.factorize('look' + df['decision_resp'].str.title() + '_t')
df['chosen_img_et'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
mask = np.array(col_list) == df['decision_resp'].to_numpy()[:, None]
df[[f'top{x+1}' for x in range(3)]] = np.sort(df[col_list].mask(mask), axis=1)[:, 2::-1]
Output:
lookRight_t lookLeft_t lookUp_t lookDown_t decision_resp chosen_img_et \
0 1.291667 1.325000 3.025000 1.141667 up 3.025000
1 0.000000 0.000000 1.125000 3.150000 down 3.150000
2 0.000000 0.000000 3.508333 2.275000 up 3.508333
3 3.700000 1.950000 0.000000 0.000000 right 3.700000
4 2.633333 1.316667 1.341667 0.000000 right 2.633333
5 1.766667 1.333333 0.825000 2.208333 down 2.208333
6 0.000000 0.000000 1.108333 5.283333 down 5.283333
top1 top2 top3
0 1.325000 1.291667 1.141667
1 1.125000 0.000000 0.000000
2 2.275000 0.000000 0.000000
3 1.950000 0.000000 0.000000
4 1.341667 1.316667 0.000000
5 1.766667 1.333333 0.825000
6 1.108333 0.000000 0.000000

Related

pandas multiindex sort by specified rules

There is a dataframe like below
arrays = [
np.array(["baz", "baz", "bar", "bar", "qux", "foo"]),
np.array(["yes", "no", "yes", "no", "yes", "no"]),
]
df = pd.DataFrame(np.random.randint(100, size=(6,4)), index=arrays)
df
Now want to know the yes_rate(yes/all) of each column
the implement code as below
first_index_list = list(df.index.get_level_values(0).unique())
for index in first_index_list:
index_sum = df.loc[index].sum()
if df.index.isin([(index,'yes')]).any():
yes_rate = df.loc[(index, 'yes')] / index_sum
df.loc[(index, 'yes_rate'),:] = yes_rate
df.loc[(index,'All'),:] = index_sum
df.sort_index()
but the code is not ideal, there are some problems,
how to sort as below order
first index: [baz,bar,qux,foo] just as first picture
second index: [no,yes,All,yes_rate]
repeat execute the code, All and yes_rate values not change
So how to only add yes and no to generate All （note: yes and no not guaranteed to exist）
index_sum = ...
if yes exists:
index_sum += df.loc[(index, 'yes')]
if no exists:
index_sum += df.loc[(index, 'no')]

IIUC, you can use pandas.concat to concatenate in the desired order, then only sort the first level:
l = ['baz', 'bar', 'qux', 'foo']
order = pd.Series({k:v for v,k in enumerate(l)})
df_all = df.groupby(level=0).sum()
out = (pd
.concat([df,
pd.concat({'All': df_all,
'yes_rate': df.xs('yes', level=1).div(df_all)})
.dropna(how='all')
.swaplevel()
],)
.sort_index(level=0, key=order.reindex, sort_remaining=False)
)
output:
0 1 2 3
baz yes 20.000000 97.000000 95.000000 38.000000
no 85.000000 73.000000 23.000000 27.000000
All 105.000000 170.000000 118.000000 65.000000
yes_rate 0.190476 0.570588 0.805085 0.584615
bar yes 86.000000 32.000000 73.000000 16.000000
no 9.000000 97.000000 2.000000 55.000000
All 95.000000 129.000000 75.000000 71.000000
yes_rate 0.905263 0.248062 0.973333 0.225352
qux yes 69.000000 16.000000 92.000000 82.000000
All 69.000000 16.000000 92.000000 82.000000
yes_rate 1.000000 1.000000 1.000000 1.000000
foo no 77.000000 5.000000 12.000000 3.000000
All 77.000000 5.000000 12.000000 3.000000

def function1(ss:pd.Series):
if ss.name=='level_1':
ss[6]='yes_rate'
ss[-1]='ALL'
else:
ss[6]=ss.iloc[0]/ss.sum()
ss[-1]=ss.sum()
return ss.sort_index()
df.reset_index().groupby('level_0').apply(lambda dd:dd.apply(function1).set_index('level_1'))
#%%
0 1 2 3
level_0 level_1
bar ALL 164.506098 136.514706 176.511364 83.012048
yes 83.000000 70.000000 90.000000 1.000000
no 81.000000 66.000000 86.000000 82.000000
yes_rate 0.506098 0.514706 0.511364 0.012048
baz ALL 100.310000 32.281250 35.228571 143.342657
yes 31.000000 9.000000 8.000000 49.000000
no 69.000000 23.000000 27.000000 94.000000
yes_rate 0.310000 0.281250 0.228571 0.342657
foo ALL 52.000000 29.000000 29.000000 35.000000
no 51.000000 28.000000 28.000000 34.000000
yes_rate 1.000000 1.000000 1.000000 1.000000
qux ALL 7.000000 65.000000 33.000000 57.000000
yes 6.000000 64.000000 32.000000 56.000000
yes_rate 1.000000 1.000000 1.000000 1.000000

How to keep the data with values greater than 1?

I have following data:
Reporter ISO AFG AGO ALB ARE ARG ARM AUS AUT AZE BEL BFA BGD BGR BHR BIH BLR BOL BRA BWA CAN CHE CHL CHN CIV CMR COD COG COL CRI CUB CYP CZE DEU DNK DOM DZA ECU EGY ESP EST ETH FIN FRA GAB GBR GEO GHA GIN GNQ GRC GTM HKG HND HRV HTI HUN IDN IND IRL IRN IRQ ISR ITA JAM JOR JPN KAZ KEN KGZ KHM KOR KWT LAO LBN LBR LBY LKA LSO LTU LVA MAR MDA MDG MEX MKD MLI MMR MNG MOZ MRT MUS MYS NAM NER NGA NIC \
SITC
0011 0.000000 0.000000 0.000000 0.004613 0.033138 3.359434 8.726847 1.191898 0.002008 0.728046 0.437901 0.000000 1.799574 0.000000 0.421612 0.340513 0.000000 3.297533 2.755377 4.662559 0.002025 0.012436 0.050989 0.000000 0.000000 0.000000 0.000000 2.985603 0.058683 0.000000 0.005523 2.243466 0.481428 1.750662 0.000000 0.000000 0.000000 0.136531 2.034463 2.978088 0.361915 0.010336 6.421094 0.0 0.011530 11.906661 6.940706e-04 0.000000 0.0 0.020897 0.000583 0.000000 0.005827 5.801019 0.000000 2.783294 0.000000 0.056718 1.497509 0.002035 0.000000 0.000000e+00 0.034540 0.000000 0.000000 1.146623e-05 2.983633 0.328246 0.738821 0.000000 0.000000 0.005245 75.826263 2.077911 0.0 0.000000 0.000000 0.173766 2.974509 5.519316 0.000000 6.844561 0.000000 2.154048 0.594069 0.000000 26.490391 0.000000 0.000174 0.000000 0.000000 0.004466 34.975037 0.003507 0.000007 2.099508
0012 1.375525 0.000000 0.133913 0.084067 0.006681 16.383479 4.062770 0.201646 0.003303 0.011301 0.919492 0.000000 1.473571 0.000000 0.000000 0.010076 0.000000 0.000017 0.021425 0.021679 0.000147 0.030245 0.003140 0.000000 0.000362 0.000000 0.000000 0.000167 0.006618 0.000000 0.399117 0.075002 0.014116 0.007865 0.000000 0.000000 0.000000 0.061023 5.865379 0.549569 1.101432 0.000746 0.768423 0.0 0.714328 41.018233 1.013471e-03 0.000000 0.0 0.545327 0.000000 0.000000 0.000000 0.673710 0.000000 5.297044 0.015160 0.584192 0.072973 16.858856 0.000000 0.000000e+00 0.000389 0.000000 191.328757 0.000000e+00 5.018329 0.011129 2.938414 0.000000 0.000000 1.265317 1.255584 31.576510 0.0 0.000000 0.000000 4.233523 0.097645 0.119054 0.000000 2.542809 0.006852 0.001415 0.000000 0.000000 0.895853 0.000000 0.001034 0.000000 0.000000 0.000597 59.614346 0.111905 0.000000 0.004668
0013 0.000000 0.000000 0.000000 0.000000 0.002183 0.000000 0.000000 0.118824 0.000000 1.891616 0.000000 0.000000 0.166649 0.000000 0.434766 0.000000 0.000000 0.122091 0.000000 3.206870 0.002316 0.000000 0.667829 0.000000 0.000000 0.000000 0.000000 0.024970 0.198896 0.000000 0.004364 1.446706 0.571243 45.310792 0.000000 0.000000 0.000000 0.000000 2.285895 0.343793 0.000000 0.189181 0.984170 0.0 0.007111 0.000000 0.000000e+00 0.000000 0.0 1.500368 0.000000 0.000000 0.000000 11.634354 0.000000 2.297556 1.535006 0.000242 1.827070 0.000000 0.000000 0.000000e+00 0.001336 0.000000 0.000000 0.000000e+00 0.000000 0.008365 0.000000 0.012099 0.000000 0.000000 0.050146 0.000000 0.0 0.000000 0.000000 0.000000 2.107540 4.406836 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.027162 0.000000 0.000000 0.000000 0.000000 0.204559 0.000000 0.001059 0.000000 0.000000
0014 0.000000 0.000000 0.000000 0.015802 0.104162 0.000000 0.177206 1.673901 0.000000 2.213890 0.000000 0.000000 1.884009 0.002777 0.698578 0.256640 0.000000 2.192021 0.000053 0.693865 0.013870 0.000000 0.003156 0.000738 0.000000 0.001240 0.098058 0.311633 3.117316 0.000000 0.002897 2.372971 2.589253 5.315211 0.000000 0.000000 0.000000 0.008489 1.197066 0.067713 0.000000 0.238736 2.808734 0.0 2.362742 3.864489 1.944358e-03 0.000081 0.0 0.819977 0.855736 0.000005 0.419152 2.823619 0.000000 6.604809 0.011153 0.009950 0.259364 0.099504 0.000090 0.000000e+00 0.105376 0.078613 1.520528 8.424726e-07 0.053117 0.096929 0.088181 0.000000 0.001169 0.000311 1.712818 0.010306 0.0 0.000000 0.001087 0.004644 0.688807 0.541286 0.006068 0.000000 0.058922 0.000000 0.034544 0.040636 0.057898 0.000000 0.000000 0.000000 1.447492 3.829856 0.000223 0.011361 0.000636 0.000000
0015 0.000000 0.000000 0.000000 0.445323 2.195289 0.000000 2.307454 0.028947 0.026562 1.560077 0.002505 0.000008 0.211124 0.166718 0.003793 0.081355 0.000000 0.286775 0.000936 0.962403 0.279596 0.509734 0.000336 0.000000 0.000000 0.000124 0.000000 0.077661 0.110068 0.000000 0.009396 0.015761 0.912955 1.231179 0.027379 0.000000 0.007512 0.011966 0.231880 0.061287 0.000000 0.019807 2.278230 0.0 6.008955 0.000000 0.000000e+00 0.000000 0.0 0.003849 0.013636 6.383234 0.026093 0.512240 0.000000 0.090692 0.000028 0.003678 10.118232 0.002744 0.000000 5.552301e-03 0.052228 0.464133 0.076198 1.859848e-01 0.077715 0.009641 4.554831 0.000000 0.041567 0.030370 0.000706 0.050001 0.0 0.000000 0.001080 0.016164 0.098448 0.330189 0.035837 0.000000 0.000000 0.045076 0.002459 0.000000 0.000000 0.204988 0.000000 0.000000 0.323190 0.017127 0.118358 1.816213 0.000000 0.061127
0019 2.404198 0.391568 4.746603 0.936277 0.472483 0.029448 0.197240 0.184044 0.015951 4.052927 0.000028 0.000000 0.039448 0.696807 0.055031 0.278160 0.003875 0.039235 0.142762 1.905431 0.122174 0.426857 0.350115 0.002159 1.978486 0.490255 0.013705 0.773129 0.114343 0.578595 0.931443 1.205081 0.423272 4.725125 0.068519 0.006181 1.301403 24.162001 1.983499 0.085567 0.205218 0.010852 2.269924 0.0 1.177672 0.059922 4.544931e-01 0.321237 0.0 0.302619 0.026811 0.127578 0.008496 0.028143 0.566884 1.295090 0.458030 0.008520 0.141866 0.483547 0.000115 4.447335e+00 0.486243 0.215677 1.878902 3.570814e-01 0.547522 25.609184 1.088240 31.133218 0.071370 1.603031 0.557699 2.015355 0.0 0.013534 0.018164 0.000000 0.837837 0.472243 6.385796 0.143335 2.607711 0.253619 0.530978 3.819587 0.007063 1.876078 0.061506 0.003411 178.476453 0.162198 4.326213 0.699293 0.007034 1.278230
0111 0.000000 0.000000 0.079421 0.017039 8.901930 0.016561 6.866686 1.690299 0.002603 1.098923 0.000039 0.000431 0.011489 0.000469 0.079557 7.099912 0.266318 3.096303 4.389923 3.143088 0.000200 0.069349 0.000057 0.004926 0.000137 0.000000 0.000000 0.407158 0.751635 0.000000 0.010285 0.217308 0.600825 1.897867 0.006995 0.000000 0.000000 0.022382 1.547584 0.294325 5.218987 0.299988 1.380903 0.0 0.816653 0.024922 4.210469e-07 0.001651 0.0 0.043783 0.000253 0.018290 0.883637 1.489829 0.000000 0.225030 0.000000 0.117034 8.112654 0.039480 0.000779 2.220238e-05 0.599626 0.000475 0.012492 1.381745e-01 0.105032 0.535493 0.002049 0.000000 0.003681 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000387 1.866342 1.449632 0.000032 0.173435 0.000000 1.972488 0.000000 0.000000 0.004182 0.000000 0.000000 0.000000 0.000000 0.001961 2.985457 0.004949 0.000049 34.018866
0112 0.027311 0.002101 0.000000 0.803547 23.202077 0.004398 10.508959 0.519682 0.001062 0.073772 0.000000 0.000079 0.018364 0.003869 0.008124 4.198086 0.407166 16.772364 4.115303 0.606431 0.001592 0.717869 0.000413 0.001026 0.000165 0.000000 0.001540 0.745339 5.223481 0.000000 0.007199 0.025556 0.136980 0.196708 0.163228 0.001689 0.000194 0.005037 0.336007 0.083353 0.001620 0.047842 0.075827 0.0 0.151379 0.551067 1.252943e-05 0.000000 0.0 0.019852 0.497362 0.352512 1.498862 0.232900 0.000000 0.065593 0.000016 6.389757 1.316056 0.003152 0.000508 0.000000e+00 0.197412 0.005625 0.847232 1.302507e-01 0.171425 0.673704 0.000031 0.000000 0.000458 0.020664 0.351126 0.005583 0.0 0.000000 0.000000 0.000000 0.791597 0.202655 0.004815 0.478381 0.000000 0.227916 0.004645 0.000014 0.013300 0.016229 0.000000 0.000000 0.021640 0.006599 3.261191 0.000000 0.000355 37.088510
0121 0.001582 0.000000 0.000000 0.052209 0.206769 7.540986 23.627297 0.019910 0.013100 0.586377 0.000000 0.000279 0.530770 0.004078 0.011636 0.000000 0.000000 0.005445 0.004388 0.005877 0.000400 1.215561 0.021429 0.005590 0.003670 0.000000 0.001588 0.000000 0.000000 0.000000 0.050929 0.015053 0.104453 0.122767 0.000000 0.000000 0.000343 0.000481 1.535667 0.091318 52.880101 0.000933 0.329943 0.0 2.614312 9.232351 0.000000e+00 0.000000 0.0 2.868348 0.000000 0.026744 0.000000 0.128611 0.000000 0.049441 0.000000 0.747861 4.649703 0.194667 0.002230 0.000000e+00 0.095171 0.004332 0.429761 8.953811e-05 0.514561 21.753044 3.573798 0.000014 0.000000 0.000000 0.000000 0.086494 0.0 0.000000 0.000000 0.008787 0.013936 0.211361 0.000246 4.579582 0.051408 0.034361 3.610196 0.000000 0.176476 7.077663 0.000000 0.103573 0.002899 0.001336 2.437478 0.000000 0.006742 0.000000
0122 0.000000 0.000000 0.000000 0.001704 0.215322 0.000052 0.144528 1.429577 0.000000 1.869959 0.000000 0.000000 0.213683 0.000000 0.092459 0.052490 0.054549 3.761311 0.007157 3.351890 0.004849 4.318209 0.032182 0.001256 0.000000 0.000000 0.000951 0.000006 0.058686 0.000000 0.681542 0.204737 1.946622 13.168414 0.003606 0.000000 0.000000 0.000142 8.679341 1.176507 0.000000 0.649505 1.162449 0.0 0.604442 0.431210 0.000000e+00 0.000000 0.0 0.178929 0.000000 0.069482 0.002235 0.770522 0.000000 1.796879 0.000193 0.003041 1.622033 0.001978 0.000000 7.350485e-07 0.208763 0.000071 0.000000 6.750007e-03 0.014725 0.124527 0.000000 0.000000 0.000426 0.000000 0.058161 0.000042 0.0 0.000000 0.000541 0.000503 0.330381 0.321049 0.000264 0.000000 0.000000 0.859326 0.001074 0.000000 0.020189 0.000000 0.000000 0.000000 0.000342 0.014416 0.076196 0.112682 0.000000 0.015525
I want to drop the values that are less than 1

missing observation panel data, bring forward value 20 periods

Here's to read in a DataFrame like the one I'm looking at
pd.DataFrame({
'period' : [1, 2, 3, 4, 5, 8, 9, 10, 11, 13, 14, 15, 16, 19, 20, 21, 22,
23, 25, 26],
'id' : [1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285,
1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285],
'pred': [-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775],
'ret' : [ None, -0.02222222, -0.01363636, 0. , -0.02764977,
None, -0.00909091, -0.01376147, 0.00465116, None,
0.01869159, 0. , 0. , None , -0.00460829,
0.00462963, 0.02304147, 0. , None, -0.00050756]})
Which will look like this when read in.
period id pred ret
0 1 1285 -1.653477 NaN
1 2 1285 -1.653477 -0.022222
2 3 1285 -1.653477 -0.013636
3 4 1285 -1.653477 0.000000
4 5 1285 -1.653477 -0.027650
5 8 1285 -1.653477 NaN
6 9 1285 -1.653477 -0.009091
7 10 1285 -1.653477 -0.013761
8 11 1285 -1.653477 0.004651
9 13 1285 -1.653477 NaN
10 14 1285 -1.653477 0.018692
11 15 1285 -1.653477 0.000000
12 16 1285 -1.653477 0.000000
13 19 1285 -1.653477 NaN
14 20 1285 -1.653477 -0.004608
15 21 1285 -1.653477 0.004630
16 22 1285 -1.653477 0.023041
17 23 1285 -1.653477 0.000000
18 25 1285 -1.653477 NaN
19 26 1285 -1.653477 -0.000508
pred is 20 period prediction and consequently I want to do is bring the returns back 20 days. (but do it in a flexible way.)
Here's the lag function I have presently
def lag(df, col, lag_dist=1, ref='period', group='id'):
df = df.copy()
new_col = 'lag'+str(lag_dist)+'_'+col
df[new_col] = df.groupby(group)[col].shift(lag_dist)
# set NaN values that differ from specified
df[new_col] = (df.groupby(group)[ref]
.shift(lag_dist)
.sub(df[ref])
.eq(-lag_dist)
.mul(1)
.replace(0,np.nan)*df[new_col])
return df[new_col]
but when I run
df['fut20_ret'] = lag(df, 'ret', -20, 'period')
df.head(20)
I get
period id pred gain fee prc ret fut20_ret
0 1 1285 -1.653478 0.000000 0.87 1.000000 NaN NaN
1 2 1285 -1.653478 -0.022222 0.87 0.977778 -0.022222 NaN
2 3 1285 -1.653478 -0.035556 0.87 0.964444 -0.013636 NaN
3 4 1285 -1.653478 -0.035556 0.87 0.964444 0.000000 NaN
4 5 1285 -1.653478 -0.062222 0.87 0.937778 -0.027650 NaN
6 8 1285 -1.653478 -0.022222 0.87 0.977778 NaN NaN
7 9 1285 -1.653478 -0.031111 0.87 0.968889 -0.009091 NaN
8 10 1285 -1.653478 -0.044444 0.87 0.955556 -0.013761 NaN
9 11 1285 -1.653478 -0.040000 0.87 0.960000 0.004651 NaN
10 13 1285 -1.653478 -0.048889 0.87 0.951111 NaN NaN
11 14 1285 -1.653478 -0.031111 0.87 0.968889 0.018692 NaN
12 15 1285 -1.653478 -0.031111 0.87 0.968889 0.000000 NaN
13 16 1285 -1.653478 -0.031111 0.87 0.968889 0.000000 NaN
15 19 1285 -1.653478 -0.035556 0.87 0.964444 NaN NaN
16 20 1285 -1.653478 -0.040000 0.87 0.960000 -0.004608 NaN
17 21 1285 -1.653478 -0.035556 0.87 0.964444 0.004630 NaN
18 22 1285 -1.653478 -0.013333 0.87 0.986667 0.023041 NaN
19 23 1285 -1.653478 -0.013333 0.87 0.986667 0.000000 NaN
How can I modify my lag function so that it works properly? It's close but I'm struggling on the last little bit.

How to calculate the aggregate variance in pivot table

when I use aggfunc = np.var in pivot table. I found the value of metrics became NaN. But when it comes to aggfunc = np.sum it doesn't.
why the original value was changed with aggfunc = np.var or aggfunc = np.std. I can not found answer in the docs. docs of pivot table
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
"bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
"C": ["small", "large", "large", "small",
"small", "large", "small", "small",
"large"],
"D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
"E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
print(df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.sum,
margins=True,
margins_name = 'sum',
dropna = False
))
print('-' * 100)
df = df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.var,
margins=True,
margins_name = 'var',
dropna = False
)
print(df)
D E
C large small sum large small sum
A B
bar one 4.0 5.0 9 6.0 8.0 14
two 7.0 6.0 13 9.0 9.0 18
foo one 4.0 1.0 5 9.0 2.0 11
two NaN 6.0 6 NaN 11.0 11
sum 15.0 18.0 33 24.0 30.0 54
-----------------------------------------------------------------------
D E
C large small var large small var
A B
bar one NaN NaN 0.500000 NaN NaN 2.000000
two NaN NaN 0.500000 NaN NaN 0.000000
foo one 0.000000 NaN 0.333333 0.500000 NaN 2.333333
two NaN 0.0 0.000000 NaN 0.5 0.500000
var 5.583333 3.8 3.555556 4.666667 7.5 4.888889
what's more, I found the var of D = large is np.var([4.0, 7.0, 4.0]) = 2.0 instead of 5.583333.
what I expected is:
D E
C large small var large small var
A B
bar one 4.0 5.0 0.25 6.0 8.0 1.0
two 7.0 6.0 0.25 9.0 9.0 0
foo one 4.0 1.0 2.25 9.0 2.0 12.25
two NaN 6.0 0 NaN 11.0 0.0
var 2.0 4.25 3.6 2.0 11.25 7.34
What is the meaning of aggfunc = np.var in pivot table?

Pandas uses by default ddof = 1, see here for details on np.var.
When you have just one value, then the variance using ddof = 1 will be NaN as you try to divide by zero.
Var of D = large is np.var([2,2,4,7], ddof=1) = 5.583333333333333, so everything is correct (you'll have to use the individual values, not the sums).
If you need var with ddof = 0 then you can provide your own function:
def var0(x):
return np.var(x, ddof=0)
print(df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= var0,
margins=True,
margins_name = 'var',
dropna = False
))
Result:
D E
C large small var large small var
A B
bar one 0.0000 0.00 0.250000 0.00 0.00 1.000000
two 0.0000 0.00 0.250000 0.00 0.00 0.000000
foo one 0.0000 0.00 0.222222 0.25 0.00 1.555556
two NaN 0.00 0.000000 NaN 0.25 0.250000
var 4.1875 3.04 3.555556 3.50 6.00 4.888889
UPDATE based on the edited question.
Pivot table with the sums of C and additionally the var of the sums as margin columns/row.
We first create the sum pivot table with margin columns/row named var. Then we updated these margin columns/row with the var of the sum table:
dfs = df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.sum,
margins=True,
margins_name = 'var',
dropna = False)
dfs[[('D','var'),('E','var')]] = df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.sum,
dropna = False).stack().groupby(level=(0,1)).apply(var0)
dfs.iloc[-1] = dfs.iloc[:-1].apply(var0)
Result:
D E
C large small var large small var
A B
bar one 4.0 5.00 0.250000 6.0 8.00 1.000000
two 7.0 6.00 0.250000 9.0 9.00 0.000000
foo one 4.0 1.00 2.250000 9.0 2.00 12.250000
two NaN 6.00 0.000000 NaN 11.00 0.000000
var 2.0 4.25 0.824219 2.0 11.25 26.792969
In the margin row (last row) the var columns are calculated as the var of the row vars. I don't understand how the OP calculated his values for these two cells. Anyway they don't seem to make much sense.

awk print column for a given range

My data looks like that:
0.000000 0.071429 0.071429 0.857143
0.071429 0.428571 0.071429 0.428571
0.357143 0.214286 0.357143 0.071429
0.000000 0.714286 0.000000 0.285714
0.000000 0.571429 0.000000 0.428571
0.428571 0.357143 0.071429 0.142857
0.000000 0.071429 0.071429 0.857143
0.071429 0.000000 0.928571 0.000000
0.000000 0.071429 0.000000 0.928571
0.000000 0.285714 0.000000 0.714286
0.142857 0.000000 0.785714 0.071429
I want it to look like that:
AC name_of_the_file.txt
00 0.000000 0.071429 0.071429 0.857143
01 0.071429 0.428571 0.071429 0.428571
02 0.357143 0.214286 0.357143 0.071429
03 0.000000 0.714286 0.000000 0.285714
04 0.000000 0.571429 0.000000 0.428571
05 0.428571 0.357143 0.071429 0.142857
06 0.000000 0.071429 0.071429 0.857143
07 0.071429 0.000000 0.928571 0.000000
08 0.000000 0.071429 0.000000 0.928571
09 0.000000 0.285714 0.000000 0.714286
10 0.142857 0.000000 0.785714 0.071429
XX
//
How can I awk $1 for a range (from 00 till the file ends)?

One way:
awk 'FNR == 1 { print FILENAME } { printf "%02d %s\n", FNR - 1, $0 }' infile
Output:
infile
00 00 0.000000 0.071429 0.071429 0.857143
01 01 0.071429 0.428571 0.071429 0.428571
02 02 0.357143 0.214286 0.357143 0.071429
03 03 0.000000 0.714286 0.000000 0.285714
04 04 0.000000 0.571429 0.000000 0.428571
05 05 0.428571 0.357143 0.071429 0.142857
06 06 0.000000 0.071429 0.071429 0.857143
07 07 0.071429 0.000000 0.928571 0.000000
08 08 0.000000 0.071429 0.000000 0.928571
09 09 0.000000 0.285714 0.000000 0.714286
10 10 0.142857 0.000000 0.785714 0.071429

I'm not clear what the real question is. Superficially, this should do the job:
awk '{ if (line++ == 0) { print "AC " FILENAME; } printf("%.2d %s\n", NR-1, $0); }
END { print "XX\n//" }'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Find top 3 highest values across 3 columns row-wise pandas - pandas

Related

pandas multiindex sort by specified rules

How to keep the data with values greater than 1?

missing observation panel data, bring forward value 20 periods

How to calculate the aggregate variance in pivot table

awk print column for a given range

Categories

Resources