How to keep the data with values greater than 1? - pandas

I have following data:
Reporter ISO AFG AGO ALB ARE ARG ARM AUS AUT AZE BEL BFA BGD BGR BHR BIH BLR BOL BRA BWA CAN CHE CHL CHN CIV CMR COD COG COL CRI CUB CYP CZE DEU DNK DOM DZA ECU EGY ESP EST ETH FIN FRA GAB GBR GEO GHA GIN GNQ GRC GTM HKG HND HRV HTI HUN IDN IND IRL IRN IRQ ISR ITA JAM JOR JPN KAZ KEN KGZ KHM KOR KWT LAO LBN LBR LBY LKA LSO LTU LVA MAR MDA MDG MEX MKD MLI MMR MNG MOZ MRT MUS MYS NAM NER NGA NIC \
SITC
0011 0.000000 0.000000 0.000000 0.004613 0.033138 3.359434 8.726847 1.191898 0.002008 0.728046 0.437901 0.000000 1.799574 0.000000 0.421612 0.340513 0.000000 3.297533 2.755377 4.662559 0.002025 0.012436 0.050989 0.000000 0.000000 0.000000 0.000000 2.985603 0.058683 0.000000 0.005523 2.243466 0.481428 1.750662 0.000000 0.000000 0.000000 0.136531 2.034463 2.978088 0.361915 0.010336 6.421094 0.0 0.011530 11.906661 6.940706e-04 0.000000 0.0 0.020897 0.000583 0.000000 0.005827 5.801019 0.000000 2.783294 0.000000 0.056718 1.497509 0.002035 0.000000 0.000000e+00 0.034540 0.000000 0.000000 1.146623e-05 2.983633 0.328246 0.738821 0.000000 0.000000 0.005245 75.826263 2.077911 0.0 0.000000 0.000000 0.173766 2.974509 5.519316 0.000000 6.844561 0.000000 2.154048 0.594069 0.000000 26.490391 0.000000 0.000174 0.000000 0.000000 0.004466 34.975037 0.003507 0.000007 2.099508
0012 1.375525 0.000000 0.133913 0.084067 0.006681 16.383479 4.062770 0.201646 0.003303 0.011301 0.919492 0.000000 1.473571 0.000000 0.000000 0.010076 0.000000 0.000017 0.021425 0.021679 0.000147 0.030245 0.003140 0.000000 0.000362 0.000000 0.000000 0.000167 0.006618 0.000000 0.399117 0.075002 0.014116 0.007865 0.000000 0.000000 0.000000 0.061023 5.865379 0.549569 1.101432 0.000746 0.768423 0.0 0.714328 41.018233 1.013471e-03 0.000000 0.0 0.545327 0.000000 0.000000 0.000000 0.673710 0.000000 5.297044 0.015160 0.584192 0.072973 16.858856 0.000000 0.000000e+00 0.000389 0.000000 191.328757 0.000000e+00 5.018329 0.011129 2.938414 0.000000 0.000000 1.265317 1.255584 31.576510 0.0 0.000000 0.000000 4.233523 0.097645 0.119054 0.000000 2.542809 0.006852 0.001415 0.000000 0.000000 0.895853 0.000000 0.001034 0.000000 0.000000 0.000597 59.614346 0.111905 0.000000 0.004668
0013 0.000000 0.000000 0.000000 0.000000 0.002183 0.000000 0.000000 0.118824 0.000000 1.891616 0.000000 0.000000 0.166649 0.000000 0.434766 0.000000 0.000000 0.122091 0.000000 3.206870 0.002316 0.000000 0.667829 0.000000 0.000000 0.000000 0.000000 0.024970 0.198896 0.000000 0.004364 1.446706 0.571243 45.310792 0.000000 0.000000 0.000000 0.000000 2.285895 0.343793 0.000000 0.189181 0.984170 0.0 0.007111 0.000000 0.000000e+00 0.000000 0.0 1.500368 0.000000 0.000000 0.000000 11.634354 0.000000 2.297556 1.535006 0.000242 1.827070 0.000000 0.000000 0.000000e+00 0.001336 0.000000 0.000000 0.000000e+00 0.000000 0.008365 0.000000 0.012099 0.000000 0.000000 0.050146 0.000000 0.0 0.000000 0.000000 0.000000 2.107540 4.406836 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.027162 0.000000 0.000000 0.000000 0.000000 0.204559 0.000000 0.001059 0.000000 0.000000
0014 0.000000 0.000000 0.000000 0.015802 0.104162 0.000000 0.177206 1.673901 0.000000 2.213890 0.000000 0.000000 1.884009 0.002777 0.698578 0.256640 0.000000 2.192021 0.000053 0.693865 0.013870 0.000000 0.003156 0.000738 0.000000 0.001240 0.098058 0.311633 3.117316 0.000000 0.002897 2.372971 2.589253 5.315211 0.000000 0.000000 0.000000 0.008489 1.197066 0.067713 0.000000 0.238736 2.808734 0.0 2.362742 3.864489 1.944358e-03 0.000081 0.0 0.819977 0.855736 0.000005 0.419152 2.823619 0.000000 6.604809 0.011153 0.009950 0.259364 0.099504 0.000090 0.000000e+00 0.105376 0.078613 1.520528 8.424726e-07 0.053117 0.096929 0.088181 0.000000 0.001169 0.000311 1.712818 0.010306 0.0 0.000000 0.001087 0.004644 0.688807 0.541286 0.006068 0.000000 0.058922 0.000000 0.034544 0.040636 0.057898 0.000000 0.000000 0.000000 1.447492 3.829856 0.000223 0.011361 0.000636 0.000000
0015 0.000000 0.000000 0.000000 0.445323 2.195289 0.000000 2.307454 0.028947 0.026562 1.560077 0.002505 0.000008 0.211124 0.166718 0.003793 0.081355 0.000000 0.286775 0.000936 0.962403 0.279596 0.509734 0.000336 0.000000 0.000000 0.000124 0.000000 0.077661 0.110068 0.000000 0.009396 0.015761 0.912955 1.231179 0.027379 0.000000 0.007512 0.011966 0.231880 0.061287 0.000000 0.019807 2.278230 0.0 6.008955 0.000000 0.000000e+00 0.000000 0.0 0.003849 0.013636 6.383234 0.026093 0.512240 0.000000 0.090692 0.000028 0.003678 10.118232 0.002744 0.000000 5.552301e-03 0.052228 0.464133 0.076198 1.859848e-01 0.077715 0.009641 4.554831 0.000000 0.041567 0.030370 0.000706 0.050001 0.0 0.000000 0.001080 0.016164 0.098448 0.330189 0.035837 0.000000 0.000000 0.045076 0.002459 0.000000 0.000000 0.204988 0.000000 0.000000 0.323190 0.017127 0.118358 1.816213 0.000000 0.061127
0019 2.404198 0.391568 4.746603 0.936277 0.472483 0.029448 0.197240 0.184044 0.015951 4.052927 0.000028 0.000000 0.039448 0.696807 0.055031 0.278160 0.003875 0.039235 0.142762 1.905431 0.122174 0.426857 0.350115 0.002159 1.978486 0.490255 0.013705 0.773129 0.114343 0.578595 0.931443 1.205081 0.423272 4.725125 0.068519 0.006181 1.301403 24.162001 1.983499 0.085567 0.205218 0.010852 2.269924 0.0 1.177672 0.059922 4.544931e-01 0.321237 0.0 0.302619 0.026811 0.127578 0.008496 0.028143 0.566884 1.295090 0.458030 0.008520 0.141866 0.483547 0.000115 4.447335e+00 0.486243 0.215677 1.878902 3.570814e-01 0.547522 25.609184 1.088240 31.133218 0.071370 1.603031 0.557699 2.015355 0.0 0.013534 0.018164 0.000000 0.837837 0.472243 6.385796 0.143335 2.607711 0.253619 0.530978 3.819587 0.007063 1.876078 0.061506 0.003411 178.476453 0.162198 4.326213 0.699293 0.007034 1.278230
0111 0.000000 0.000000 0.079421 0.017039 8.901930 0.016561 6.866686 1.690299 0.002603 1.098923 0.000039 0.000431 0.011489 0.000469 0.079557 7.099912 0.266318 3.096303 4.389923 3.143088 0.000200 0.069349 0.000057 0.004926 0.000137 0.000000 0.000000 0.407158 0.751635 0.000000 0.010285 0.217308 0.600825 1.897867 0.006995 0.000000 0.000000 0.022382 1.547584 0.294325 5.218987 0.299988 1.380903 0.0 0.816653 0.024922 4.210469e-07 0.001651 0.0 0.043783 0.000253 0.018290 0.883637 1.489829 0.000000 0.225030 0.000000 0.117034 8.112654 0.039480 0.000779 2.220238e-05 0.599626 0.000475 0.012492 1.381745e-01 0.105032 0.535493 0.002049 0.000000 0.003681 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000387 1.866342 1.449632 0.000032 0.173435 0.000000 1.972488 0.000000 0.000000 0.004182 0.000000 0.000000 0.000000 0.000000 0.001961 2.985457 0.004949 0.000049 34.018866
0112 0.027311 0.002101 0.000000 0.803547 23.202077 0.004398 10.508959 0.519682 0.001062 0.073772 0.000000 0.000079 0.018364 0.003869 0.008124 4.198086 0.407166 16.772364 4.115303 0.606431 0.001592 0.717869 0.000413 0.001026 0.000165 0.000000 0.001540 0.745339 5.223481 0.000000 0.007199 0.025556 0.136980 0.196708 0.163228 0.001689 0.000194 0.005037 0.336007 0.083353 0.001620 0.047842 0.075827 0.0 0.151379 0.551067 1.252943e-05 0.000000 0.0 0.019852 0.497362 0.352512 1.498862 0.232900 0.000000 0.065593 0.000016 6.389757 1.316056 0.003152 0.000508 0.000000e+00 0.197412 0.005625 0.847232 1.302507e-01 0.171425 0.673704 0.000031 0.000000 0.000458 0.020664 0.351126 0.005583 0.0 0.000000 0.000000 0.000000 0.791597 0.202655 0.004815 0.478381 0.000000 0.227916 0.004645 0.000014 0.013300 0.016229 0.000000 0.000000 0.021640 0.006599 3.261191 0.000000 0.000355 37.088510
0121 0.001582 0.000000 0.000000 0.052209 0.206769 7.540986 23.627297 0.019910 0.013100 0.586377 0.000000 0.000279 0.530770 0.004078 0.011636 0.000000 0.000000 0.005445 0.004388 0.005877 0.000400 1.215561 0.021429 0.005590 0.003670 0.000000 0.001588 0.000000 0.000000 0.000000 0.050929 0.015053 0.104453 0.122767 0.000000 0.000000 0.000343 0.000481 1.535667 0.091318 52.880101 0.000933 0.329943 0.0 2.614312 9.232351 0.000000e+00 0.000000 0.0 2.868348 0.000000 0.026744 0.000000 0.128611 0.000000 0.049441 0.000000 0.747861 4.649703 0.194667 0.002230 0.000000e+00 0.095171 0.004332 0.429761 8.953811e-05 0.514561 21.753044 3.573798 0.000014 0.000000 0.000000 0.000000 0.086494 0.0 0.000000 0.000000 0.008787 0.013936 0.211361 0.000246 4.579582 0.051408 0.034361 3.610196 0.000000 0.176476 7.077663 0.000000 0.103573 0.002899 0.001336 2.437478 0.000000 0.006742 0.000000
0122 0.000000 0.000000 0.000000 0.001704 0.215322 0.000052 0.144528 1.429577 0.000000 1.869959 0.000000 0.000000 0.213683 0.000000 0.092459 0.052490 0.054549 3.761311 0.007157 3.351890 0.004849 4.318209 0.032182 0.001256 0.000000 0.000000 0.000951 0.000006 0.058686 0.000000 0.681542 0.204737 1.946622 13.168414 0.003606 0.000000 0.000000 0.000142 8.679341 1.176507 0.000000 0.649505 1.162449 0.0 0.604442 0.431210 0.000000e+00 0.000000 0.0 0.178929 0.000000 0.069482 0.002235 0.770522 0.000000 1.796879 0.000193 0.003041 1.622033 0.001978 0.000000 7.350485e-07 0.208763 0.000071 0.000000 6.750007e-03 0.014725 0.124527 0.000000 0.000000 0.000426 0.000000 0.058161 0.000042 0.0 0.000000 0.000541 0.000503 0.330381 0.321049 0.000264 0.000000 0.000000 0.859326 0.001074 0.000000 0.020189 0.000000 0.000000 0.000000 0.000342 0.014416 0.076196 0.112682 0.000000 0.015525
I want to drop the values that are less than 1

Related

pandas multiindex sort by specified rules

There is a dataframe like below
arrays = [
np.array(["baz", "baz", "bar", "bar", "qux", "foo"]),
np.array(["yes", "no", "yes", "no", "yes", "no"]),
]
df = pd.DataFrame(np.random.randint(100, size=(6,4)), index=arrays)
df
Now want to know the yes_rate(yes/all) of each column
the implement code as below
first_index_list = list(df.index.get_level_values(0).unique())
for index in first_index_list:
index_sum = df.loc[index].sum()
if df.index.isin([(index,'yes')]).any():
yes_rate = df.loc[(index, 'yes')] / index_sum
df.loc[(index, 'yes_rate'),:] = yes_rate
df.loc[(index,'All'),:] = index_sum
df.sort_index()
but the code is not ideal, there are some problems,
how to sort as below order
first index: [baz,bar,qux,foo] just as first picture
second index: [no,yes,All,yes_rate]
repeat execute the code, All and yes_rate values not change
So how to only add yes and no to generate All (note: yes and no not guaranteed to exist)
index_sum = ...
if yes exists:
index_sum += df.loc[(index, 'yes')]
if no exists:
index_sum += df.loc[(index, 'no')]
IIUC, you can use pandas.concat to concatenate in the desired order, then only sort the first level:
l = ['baz', 'bar', 'qux', 'foo']
order = pd.Series({k:v for v,k in enumerate(l)})
df_all = df.groupby(level=0).sum()
out = (pd
.concat([df,
pd.concat({'All': df_all,
'yes_rate': df.xs('yes', level=1).div(df_all)})
.dropna(how='all')
.swaplevel()
],)
.sort_index(level=0, key=order.reindex, sort_remaining=False)
)
output:
0 1 2 3
baz yes 20.000000 97.000000 95.000000 38.000000
no 85.000000 73.000000 23.000000 27.000000
All 105.000000 170.000000 118.000000 65.000000
yes_rate 0.190476 0.570588 0.805085 0.584615
bar yes 86.000000 32.000000 73.000000 16.000000
no 9.000000 97.000000 2.000000 55.000000
All 95.000000 129.000000 75.000000 71.000000
yes_rate 0.905263 0.248062 0.973333 0.225352
qux yes 69.000000 16.000000 92.000000 82.000000
All 69.000000 16.000000 92.000000 82.000000
yes_rate 1.000000 1.000000 1.000000 1.000000
foo no 77.000000 5.000000 12.000000 3.000000
All 77.000000 5.000000 12.000000 3.000000
def function1(ss:pd.Series):
if ss.name=='level_1':
ss[6]='yes_rate'
ss[-1]='ALL'
else:
ss[6]=ss.iloc[0]/ss.sum()
ss[-1]=ss.sum()
return ss.sort_index()
df.reset_index().groupby('level_0').apply(lambda dd:dd.apply(function1).set_index('level_1'))
#%%
0 1 2 3
level_0 level_1
bar ALL 164.506098 136.514706 176.511364 83.012048
yes 83.000000 70.000000 90.000000 1.000000
no 81.000000 66.000000 86.000000 82.000000
yes_rate 0.506098 0.514706 0.511364 0.012048
baz ALL 100.310000 32.281250 35.228571 143.342657
yes 31.000000 9.000000 8.000000 49.000000
no 69.000000 23.000000 27.000000 94.000000
yes_rate 0.310000 0.281250 0.228571 0.342657
foo ALL 52.000000 29.000000 29.000000 35.000000
no 51.000000 28.000000 28.000000 34.000000
yes_rate 1.000000 1.000000 1.000000 1.000000
qux ALL 7.000000 65.000000 33.000000 57.000000
yes 6.000000 64.000000 32.000000 56.000000
yes_rate 1.000000 1.000000 1.000000 1.000000

Find top 3 highest values across 3 columns row-wise pandas

I have four columns of values consisting of the looking times to a right/left/upper/lower positioned image. Another column shows the position of the image which was chosen (decision_resp). I created a new column showing the looking time of the chosen image. Now I want to create 3 more columns showing the looking times of the not chosen images sorted by highest looking time (top1), second highest looking time (toop2) and third highest looking time (top 3). The looking time of the chosen image has to be excluded.
These are the columns I have:
lookRight_t lookLeft_t lookUp_t lookDown_t decision_resp chosen_img_et
0 1.291667 1.325000 3.025000 1.141667 up 3.025000
1 0.000000 0.000000 1.125000 3.150000 down 3.150000
2 0.000000 0.000000 3.508333 2.275000 up 3.508333
3 3.700000 1.950000 0.000000 0.000000 right 3.700000
4 2.633333 1.316667 1.341667 0.000000 right 2.633333
5 1.766667 1.333333 0.825000 2.208333 down 2.208333
6 0.000000 0.000000 1.108333 5.283333 down 5.283333
My approach was:
# create new column for looking time of chosen image
trials.loc[trials['decision_resp']=='right','chosen_img_et'] = trials['lookRight_t']
trials.loc[trials['decision_resp']=='left','chosen_img_et'] = trials['lookLeft_t']
trials.loc[trials['decision_resp']=='down','chosen_img_et'] = trials['lookDown_t']
trials.loc[trials['decision_resp']=='up','chosen_img_et'] = trials['lookUp_t']
# here I got stuck
trials.loc[trials['decision_resp']=='right', 3 new columns (top1/2/3)] = trials[['lookLeft_t', 'lookDown_t', 'lookUp_t']].find max values and put it in order
trials.loc[trials['decision_resp']=='left', 3 new columns (top1/2/3)] = trials[['lookRight_t', 'lookDown_t', 'lookUp_t']].find max values and put it in order
trials.loc[trials['decision_resp']=='down', 3 new columns (top1/2/3)] = trials[['lookLeft_t', 'lookRight_t', 'lookUp_t']].find max values and put it in order
trials.loc[trials['decision_resp']=='up', 3 new columns (top1/2/3)] = trials[['lookLeft_t', 'lookDown_t', 'lookRight_t']].find max values and put it in order
Thank you for any help!
First you can use Looking up solution for new column chosen_img_et
You can sorting selected columns by numpy.sort and then use indexing for top3 rows without selected value, so is compared columns names by column decision_resp with broadcasting and set missing values by mask by DataFrame.mask:
cols = ['lookRight_t','lookLeft_t','lookUp_t','lookDown_t']
#replace substrings for match column decision_resp
look = [x.replace('look','').replace('_t','').lower() for x in cols]
new = [f'top{x+1}' for x in range(3)]
#lookup
idx1, cols1 = pd.factorize(trials['decision_resp'])
trials['chosen_img_et'] = (trials[cols].set_axis(look, axis=1)
.reindex(cols1, axis=1)
.to_numpy()[np.arange(len(trials)), idx1])
mask = np.array(look) == trials['decision_resp'].to_numpy()[:, None]
#np.sort sorting by default descending,
#so for ascending order use -1 in indexing and 2 is for remove first only NaN column
trials[new] = np.sort(trials[cols].mask(mask), axis=1)[:, 2::-1]
print (trials)
lookRight_t lookLeft_t lookUp_t lookDown_t decision_resp chosen_img_et \
0 1.291667 1.325000 3.025000 1.141667 up 3.025000
1 0.000000 0.000000 1.125000 3.150000 down 3.150000
2 0.000000 0.000000 3.508333 2.275000 up 3.508333
3 3.700000 1.950000 0.000000 0.000000 right 3.700000
4 2.633333 1.316667 1.341667 0.000000 right 2.633333
5 1.766667 1.333333 0.825000 2.208333 down 2.208333
6 0.000000 0.000000 1.108333 5.283333 down 5.283333
top1 top2 top3
0 1.325000 1.291667 1.141667
1 1.125000 0.000000 0.000000
2 2.275000 0.000000 0.000000
3 1.950000 0.000000 0.000000
4 1.341667 1.316667 0.000000
5 1.766667 1.333333 0.825000
6 1.108333 0.000000 0.000000
Details:
print (trials[cols].mask(mask))
lookRight_t lookLeft_t lookUp_t lookDown_t
0 1.291667 1.325000 NaN 1.141667
1 0.000000 0.000000 1.125000 NaN
2 0.000000 0.000000 NaN 2.275000
3 NaN 1.950000 0.000000 0.000000
4 NaN 1.316667 1.341667 0.000000
5 1.766667 1.333333 0.825000 NaN
6 0.000000 0.000000 1.108333 NaN
Find the max column name and extract the direction from it.
Find the max for each row.
Find the remaining 3 values and sort them as desired.
Concat the results together.
Renameing columns as I go.
decision_resp = df.idxmax(axis=1).str.extract('look(\w*)_t', expand=False)
decision_resp.rename('decision_resp', inplace=True)
chosen_img_et = df.max(axis=1, numeric_only=True)
chosen_img_et.rename('chosen_img_et', inplace=True)
top3 = df.apply(lambda x: x.nlargest(4).sort_values(ascending=False, ignore_index=True)[1:], axis=1)
top3.columns = ['top1', 'top2', 'top3']
df = pd.concat([df, decision_resp, chosen_img_et, top3], axis=1)
print(df)
Output:
lookRight_t lookLeft_t lookUp_t lookDown_t decision_resp chosen_img_et \
0 1.291667 1.325000 3.025000 1.141667 Up 3.025000
1 0.000000 0.000000 1.125000 3.150000 Down 3.150000
2 0.000000 0.000000 3.508333 2.275000 Up 3.508333
3 3.700000 1.950000 0.000000 0.000000 Right 3.700000
4 2.633333 1.316667 1.341667 0.000000 Right 2.633333
5 1.766667 1.333333 0.825000 2.208333 Down 2.208333
6 0.000000 0.000000 1.108333 5.283333 Down 5.283333
top1 top2 top3
0 1.325000 1.291667 1.141667
1 1.125000 0.000000 0.000000
2 2.275000 0.000000 0.000000
3 1.950000 0.000000 0.000000
4 1.341667 1.316667 0.000000
5 1.766667 1.333333 0.825000
6 1.108333 0.000000 0.000000
Other way, addressing jezrael's concerns:
col_list = ['lookRight_t', 'lookLeft_t', 'lookUp_t', 'lookDown_t']
idx, cols = pd.factorize('look' + df['decision_resp'].str.title() + '_t')
df['chosen_img_et'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
mask = np.array(col_list) == df['decision_resp'].to_numpy()[:, None]
df[[f'top{x+1}' for x in range(3)]] = np.sort(df[col_list].mask(mask), axis=1)[:, 2::-1]
Output:
lookRight_t lookLeft_t lookUp_t lookDown_t decision_resp chosen_img_et \
0 1.291667 1.325000 3.025000 1.141667 up 3.025000
1 0.000000 0.000000 1.125000 3.150000 down 3.150000
2 0.000000 0.000000 3.508333 2.275000 up 3.508333
3 3.700000 1.950000 0.000000 0.000000 right 3.700000
4 2.633333 1.316667 1.341667 0.000000 right 2.633333
5 1.766667 1.333333 0.825000 2.208333 down 2.208333
6 0.000000 0.000000 1.108333 5.283333 down 5.283333
top1 top2 top3
0 1.325000 1.291667 1.141667
1 1.125000 0.000000 0.000000
2 2.275000 0.000000 0.000000
3 1.950000 0.000000 0.000000
4 1.341667 1.316667 0.000000
5 1.766667 1.333333 0.825000
6 1.108333 0.000000 0.000000

F tensorflow/stream_executor/cuda/cuda_dnn.cc:522] Check failed

I am training my model in tensorflow. After some iterations, I get following exception:
F tensorflow/stream_executor/cuda/cuda_dnn.cc:522] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 1 feature_map_count: 1088 spatial: 63 0 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}
Issue is not with cuda library as same code works for other data-sets. What can be the reason of it ?

How to add missing data to Pandas in Monthly Data

I have this following dataframe:
Date
2002-01-01 10.0 NaN NaN
2002-05-01 NaN 30.0 40.0
2002-07-01 NaN NaN 50.0
I would like to complete the missing months with zeros. I am actualy able to do that, but I can do that only adding the entire range of days that are missing as you can get in the following code. The relevant part of the code is marked with
#############################
-
def createSeriesOfCompanies(df):
listOfCompanies=list(set(df['Company']))
dfSeries=df.pivot(index='Date', columns='Company', values='var1')
# Here I include the missing dates
#######################################################
initialDate=dfSeries.index[0]
endDate=dfSeries.index[-1]
idx = pd.date_range(initialDate, endDate)
dfSeries.index = pd.DatetimeIndex(dfSeries.index)
dfSeries = dfSeries.reindex(idx, fill_value=0)
########################################################
# Here it finishes the procedure
def creatingDataFrame():
dateList=[]
dateList.append(datetime.date(2002,1,1))
dateList.append(datetime.date(2002,7,1))
dateList.append(datetime.date(2002,5,1))
dateList.append(datetime.date(2002,5,1))
dateList.append(datetime.date(2002,7,1))
raw_data = {'Date': dateList,
'Company': ['A', 'B', 'B', 'C' , 'C'],
'var1': [10, 20, 30, 40 , 50]}
df = pd.DataFrame(raw_data, columns = ['Date','Company', 'var1'])
df.loc[1, 'var1'] = np.nan
return df
if __name__=="__main__":
df=creatingDataFrame()
print(df)
dfSeries,listOfCompanies=createSeriesOfCompanies(df)
I would like to get
Date
2002-01-01 10.0 NaN NaN
2002-02-01 0 0 0
2002-03-01 0 0 0
2002-04-01 0 0 0
2002-05-01 NaN 30.0 40.0
2002-06-01 0 0 0
2002-07-01 NaN NaN 50.0
But I am getting this
Company A B C
2002-01-01 10.0 NaN NaN
2002-01-02 0.0 0.0 0.0
2002-01-03 0.0 0.0 0.0
2002-01-04 0.0 0.0 0.0
2002-01-05 0.0 0.0 0.0
2002-01-06 0.0 0.0 0.0
2002-01-07 0.0 0.0 0.0
2002-01-08 0.0 0.0 0.0
2002-01-09 0.0 0.0 0.0
2002-01-10 0.0 0.0 0.0
2002-01-11 0.0 0.0 0.0
2002-01-12 0.0 0.0 0.0
2002-01-13 0.0 0.0 0.0
2002-01-14 0.0 0.0 0.0
2002-01-15 0.0 0.0 0.0
2002-01-16 0.0 0.0 0.0
2002-01-17 0.0 0.0 0.0
2002-01-18 0.0 0.0 0.0
2002-01-19 0.0 0.0 0.0
2002-01-20 0.0 0.0 0.0
2002-01-21 0.0 0.0 0.0
2002-01-22 0.0 0.0 0.0
2002-01-23 0.0 0.0 0.0
2002-01-24 0.0 0.0 0.0
2002-01-25 0.0 0.0 0.0
2002-01-26 0.0 0.0 0.0
2002-01-27 0.0 0.0 0.0
2002-01-28 0.0 0.0 0.0
2002-01-29 0.0 0.0 0.0
2002-01-30 0.0 0.0 0.0
...
How can I deal with this problem?
You can use reindex. Given the date is index,
df.index = pd.to_datetime(df.index)
df.reindex(pd.date_range(df.index.min(), df.index.max(), freq = 'MS'))
A B C
2002-01-01 10.0 NaN NaN
2002-02-01 NaN NaN NaN
2002-03-01 NaN NaN NaN
2002-04-01 NaN NaN NaN
2002-05-01 NaN 30.0 40.0
2002-06-01 NaN NaN NaN
2002-07-01 NaN NaN 50.0
Use asfreq by MS (start of months):
df=creatingDataFrame()
df = df.pivot(index='Date', columns='Company', values='var1').asfreq('MS', fill_value=0)
print (df)
Company A B C
Date
2002-01-01 10.0 NaN NaN
2002-02-01 0.0 0.0 0.0
2002-03-01 0.0 0.0 0.0
2002-04-01 0.0 0.0 0.0
2002-05-01 NaN 30.0 40.0
2002-06-01 0.0 0.0 0.0
2002-07-01 NaN NaN 50.0

awk print column for a given range

My data looks like that:
0.000000 0.071429 0.071429 0.857143
0.071429 0.428571 0.071429 0.428571
0.357143 0.214286 0.357143 0.071429
0.000000 0.714286 0.000000 0.285714
0.000000 0.571429 0.000000 0.428571
0.428571 0.357143 0.071429 0.142857
0.000000 0.071429 0.071429 0.857143
0.071429 0.000000 0.928571 0.000000
0.000000 0.071429 0.000000 0.928571
0.000000 0.285714 0.000000 0.714286
0.142857 0.000000 0.785714 0.071429
I want it to look like that:
AC name_of_the_file.txt
00 0.000000 0.071429 0.071429 0.857143
01 0.071429 0.428571 0.071429 0.428571
02 0.357143 0.214286 0.357143 0.071429
03 0.000000 0.714286 0.000000 0.285714
04 0.000000 0.571429 0.000000 0.428571
05 0.428571 0.357143 0.071429 0.142857
06 0.000000 0.071429 0.071429 0.857143
07 0.071429 0.000000 0.928571 0.000000
08 0.000000 0.071429 0.000000 0.928571
09 0.000000 0.285714 0.000000 0.714286
10 0.142857 0.000000 0.785714 0.071429
XX
//
How can I awk $1 for a range (from 00 till the file ends)?
One way:
awk 'FNR == 1 { print FILENAME } { printf "%02d %s\n", FNR - 1, $0 }' infile
Output:
infile
00 00 0.000000 0.071429 0.071429 0.857143
01 01 0.071429 0.428571 0.071429 0.428571
02 02 0.357143 0.214286 0.357143 0.071429
03 03 0.000000 0.714286 0.000000 0.285714
04 04 0.000000 0.571429 0.000000 0.428571
05 05 0.428571 0.357143 0.071429 0.142857
06 06 0.000000 0.071429 0.071429 0.857143
07 07 0.071429 0.000000 0.928571 0.000000
08 08 0.000000 0.071429 0.000000 0.928571
09 09 0.000000 0.285714 0.000000 0.714286
10 10 0.142857 0.000000 0.785714 0.071429
I'm not clear what the real question is. Superficially, this should do the job:
awk '{ if (line++ == 0) { print "AC " FILENAME; } printf("%.2d %s\n", NR-1, $0); }
END { print "XX\n//" }'