pandas multiindex sort by specified rules - pandas

There is a dataframe like below
arrays = [
np.array(["baz", "baz", "bar", "bar", "qux", "foo"]),
np.array(["yes", "no", "yes", "no", "yes", "no"]),
]
df = pd.DataFrame(np.random.randint(100, size=(6,4)), index=arrays)
df
Now want to know the yes_rate(yes/all) of each column
the implement code as below
first_index_list = list(df.index.get_level_values(0).unique())
for index in first_index_list:
index_sum = df.loc[index].sum()
if df.index.isin([(index,'yes')]).any():
yes_rate = df.loc[(index, 'yes')] / index_sum
df.loc[(index, 'yes_rate'),:] = yes_rate
df.loc[(index,'All'),:] = index_sum
df.sort_index()
but the code is not ideal, there are some problems,
how to sort as below order
first index: [baz,bar,qux,foo] just as first picture
second index: [no,yes,All,yes_rate]
repeat execute the code, All and yes_rate values not change
So how to only add yes and no to generate All (note: yes and no not guaranteed to exist)
index_sum = ...
if yes exists:
index_sum += df.loc[(index, 'yes')]
if no exists:
index_sum += df.loc[(index, 'no')]

IIUC, you can use pandas.concat to concatenate in the desired order, then only sort the first level:
l = ['baz', 'bar', 'qux', 'foo']
order = pd.Series({k:v for v,k in enumerate(l)})
df_all = df.groupby(level=0).sum()
out = (pd
.concat([df,
pd.concat({'All': df_all,
'yes_rate': df.xs('yes', level=1).div(df_all)})
.dropna(how='all')
.swaplevel()
],)
.sort_index(level=0, key=order.reindex, sort_remaining=False)
)
output:
0 1 2 3
baz yes 20.000000 97.000000 95.000000 38.000000
no 85.000000 73.000000 23.000000 27.000000
All 105.000000 170.000000 118.000000 65.000000
yes_rate 0.190476 0.570588 0.805085 0.584615
bar yes 86.000000 32.000000 73.000000 16.000000
no 9.000000 97.000000 2.000000 55.000000
All 95.000000 129.000000 75.000000 71.000000
yes_rate 0.905263 0.248062 0.973333 0.225352
qux yes 69.000000 16.000000 92.000000 82.000000
All 69.000000 16.000000 92.000000 82.000000
yes_rate 1.000000 1.000000 1.000000 1.000000
foo no 77.000000 5.000000 12.000000 3.000000
All 77.000000 5.000000 12.000000 3.000000

def function1(ss:pd.Series):
if ss.name=='level_1':
ss[6]='yes_rate'
ss[-1]='ALL'
else:
ss[6]=ss.iloc[0]/ss.sum()
ss[-1]=ss.sum()
return ss.sort_index()
df.reset_index().groupby('level_0').apply(lambda dd:dd.apply(function1).set_index('level_1'))
#%%
0 1 2 3
level_0 level_1
bar ALL 164.506098 136.514706 176.511364 83.012048
yes 83.000000 70.000000 90.000000 1.000000
no 81.000000 66.000000 86.000000 82.000000
yes_rate 0.506098 0.514706 0.511364 0.012048
baz ALL 100.310000 32.281250 35.228571 143.342657
yes 31.000000 9.000000 8.000000 49.000000
no 69.000000 23.000000 27.000000 94.000000
yes_rate 0.310000 0.281250 0.228571 0.342657
foo ALL 52.000000 29.000000 29.000000 35.000000
no 51.000000 28.000000 28.000000 34.000000
yes_rate 1.000000 1.000000 1.000000 1.000000
qux ALL 7.000000 65.000000 33.000000 57.000000
yes 6.000000 64.000000 32.000000 56.000000
yes_rate 1.000000 1.000000 1.000000 1.000000

Related

Extract a time and space variable from a moving ship from the ERA5 reanalysis

I want to extract the measured wind from a station inside a moving ship, which I have the latitude, longitude and time values and the wind value for each time step in space. I can extract a fixed point in space for all time steps but I would like to extract for example the wind at time step x to a date longitude and latitude as the ship moves. How can I do this from the code below?
data = xr.open_dataset('C:/Users/William Jacondino/Desktop/Dados/ERA5\\ERA5_2017.nc', decode_times=False)
dir_out = 'C:/Users/William Jacondino/Desktop/MovingShip'
if not os.path.exists(dir_out):
os.makedirs(dir_out)
print("\nReading the observation station names:\n")
stations = pd.read_csv(r"C:/Users/William Jacondino/Desktop/MovingShip/Date-TIME.csv",index_col=0, sep='\;')
print(stations)
Reading the observation station names:
Latitude Longitude
Date-Time
16/11/2017 00:00 0.219547 -38.247914
16/11/2017 06:00 0.861717 -38.188858
16/11/2017 12:00 1.529534 -38.131039
16/11/2017 18:00 2.243760 -38.067467
17/11/2017 00:00 2.961202 -38.009050
... ... ...
10/12/2017 00:00 -5.775127 -35.206581
10/12/2017 06:00 -5.775120 -35.206598
10/12/2017 12:00 -5.775119 -35.206583
10/12/2017 18:00 -5.775122 -35.206584
11/12/2017 00:00 -5.775115 -35.206590
# variável tempo e unidade
times = data.variables['time'][:]
unit = data.time.units
# variáveis latitude (lat) e longitude (lon)
lon = data.variables['longitude'][:]
lat = data.variables['latitude'][:]
# variável temperatura em 2 metros em celsius
temp = data.variables['t2m'][:]-275.15
# variável temperatura do ponto de orvalho em 2 metros em celsius
tempdw = data.variables['d2m'][:]-275.15
# variável sea surface temperature (sst) em celsius
sst = data.variables['sst'][:]-275.15
# variável Surface sensible heat flux sshf
sshf = data.variables['sshf'][:]
unitsshf = data.sshf.units
# variável Surface latent heat flux
slhf = data.variables['slhf'][:]
unitslhf = data.slhf.units
# variável Mean sea level pressure
msl = data.variables['msl'][:]/100
unitmsl = data.msl.units
# variável Total precipitation em mm/h
tp = data.variables['tp'][:]*1000
# componente zonal do vento em 100 metros
uten100 = data.variables['u100'][:]
unitu100 = data.u100.units
# componente meridional do vento em 100 metros
vten100 = data.variables['v100'][:]
unitv100 = data.v100.units
# componente zonal do vento em 10 metros
uten = data.variables['u10'][:]
unitu = data.u10.units
# componente meridional do vento em 10 metros
vten = data.variables['v10'][:]
unitv = data.v10.units
# calculando a velocidade do vento em 10 metros
ws = (uten**2 + vten**2)**(0.5)
# calculando a velocidade do vento em 100 metros
ws100 = (uten100**2 + vten100**2)**(0.5)
# calculando os ângulos de U e V para obter a direção do vento em 10 metros
wdir = (180 + (np.degrees(np.arctan2(uten, vten)))) % 360
# calculando os ângulos de U e V para obter a direção do vento em 100 metros
wdir100 = (180 + (np.degrees(np.arctan2(uten100, vten100)))) % 360
for key, value in stations.iterrows():
#print(key,value[0], value[1], value[2])
station = value[0]
file_name = "{}{}".format(station+'_1991',".csv")
#print(file_name)
lon_point = value[1]
lat_point = value[2]
########################################
# Encontrando o ponto de Latitude e Longitude mais próximo das estações
# Squared difference of lat and lon
sq_diff_lat = (lat - lat_point)**2
sq_diff_lon = (lon - lon_point)**2
# Identifying the index of the minimum value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
print("Generating time series for station {}".format(station))
ref_date = datetime.datetime(int(unit[12:16]),int(unit[17:19]),int(unit[20:22]))
date_range = list()
temp_data = list()
tempdw_data = list()
sst_data = list()
sshf_data = list()
slhf_data = list()
msl_data = list()
tp_data = list()
uten100_data = list()
vten100_data = list()
uten_data = list()
vten_data = list()
ws_data = list()
ws100_data = list()
wdir_data = list()
wdir100_data = list()
for index, time in enumerate(times):
date_time = ref_date+datetime.timedelta(hours=int(time))
date_range.append(date_time)
temp_data.append(temp[index, min_index_lat, min_index_lon].values)
tempdw_data.append(tempdw[index, min_index_lat, min_index_lon].values)
sst_data.append(sst[index, min_index_lat, min_index_lon].values)
sshf_data.append(sshf[index, min_index_lat, min_index_lon].values)
slhf_data.append(slhf[index, min_index_lat, min_index_lon].values)
msl_data.append(msl[index, min_index_lat, min_index_lon].values)
tp_data.append(tp[index, min_index_lat, min_index_lon].values)
uten100_data.append(uten100[index, min_index_lat, min_index_lon].values)
vten100_data.append(vten100[index, min_index_lat, min_index_lon].values)
uten_data.append(uten[index, min_index_lat, min_index_lon].values)
vten_data.append(vten[index, min_index_lat, min_index_lon].values)
ws_data.append(ws[index,min_index_lat,min_index_lon].values)
ws100_data.append(ws100[index,min_index_lat,min_index_lon].values)
wdir_data.append(wdir[index,min_index_lat,min_index_lon].values)
wdir100_data.append(wdir100[index,min_index_lat,min_index_lon].values)
################################################################################
#print(date_range)
df = pd.DataFrame(date_range, columns = ["Date-Time"])
df["Date-Time"] = date_range
df = df.set_index(["Date-Time"])
df["WS10 ({})".format(unitu)] = ws_data
df["WDIR10 ({})".format(units.deg)] = wdir_data
df["WS100 ({})".format(unitu)] = ws100_data
df["WDIR100 ({})".format(units.deg)] = wdir100_data
df["Chuva({})".format(units.mm)] = tp_data
df["MSLP ({})".format(units.hPa)] = msl_data
df["T2M ({})".format(units.degC)] = temp_data
df["Td2M ({})".format(units.degC)] = tempdw_data
df["Surface Sensible Heat Flux ({})".format(unitsshf)] = sshf_data
df["Surface latent heat flux ({})".format(unitslhf)] = slhf_data
df["U10 ({})".format(unitu)] = uten_data
df["V10 ({})".format(unitv)] = vten_data
df["U100 ({})".format(unitu100)] = uten100_data
df["V100 ({})".format(unitv100)] = vten100_data
df["TSM ({})".format(units.degC)] = sst_data
print("The following time series is being saved as .csv files")
df.to_csv(os.path.join(dir_out,file_name), sep=';',encoding="utf-8", index=True)
print("\n! !Successfuly saved all the Time Series the output Directory!!\n{}".format(dir_out))
My code to extract a fixed variable at a given point in space is like this, but I would like to extract during the ship's movement, for example at time 11/12/2017 00:00, latitude -5.775115 and longitude -35.206590 I have a value of the wind, and in the next time step for another latitude x longitude I have another value. How can I adapt my code for this?
This is another perfect use case for xarray's advanced indexing! I feel like this part of the user guide needs a cape and a theme song :)
I'll use a made up dataset and set of stations which are similar (I think) to yours. First step is to reset your Date-Time index, so you can use it in pulling the nearest time value from the xarray.Dataset, since you want a common index for the time, lat, and lon:
In [14]: stations = stations.reset_index(drop=False)
...: stations
Out[14]:
Date-Time Latitude Longitude
0 2017-11-16 00:00:00 0.219547 -38.247914
1 2017-11-16 06:00:00 0.861717 -38.188858
2 2017-11-16 12:00:00 1.529534 -38.131039
3 2017-11-16 18:00:00 2.243760 -38.067467
4 2017-11-17 00:00:00 2.961202 -38.009050
5 2017-12-10 00:00:00 -5.775127 -35.206581
6 2017-12-10 06:00:00 -5.775120 -35.206598
7 2017-12-10 12:00:00 -5.775119 -35.206583
8 2017-12-10 18:00:00 -5.775122 -35.206584
9 2017-12-11 00:00:00 -5.775115 -35.206590
In [15]: ds
Out[15]:
<xarray.Dataset>
Dimensions: (lat: 40, lon: 40, time: 241)
Coordinates:
* lat (lat) float64 -9.75 -9.25 -8.75 -8.25 -7.75 ... 8.25 8.75 9.25 9.75
* lon (lon) float64 -44.75 -44.25 -43.75 -43.25 ... -26.25 -25.75 -25.25
* time (time) datetime64[ns] 2017-11-01 2017-11-01T06:00:00 ... 2017-12-31
Data variables:
temp (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
tempdw (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
sst (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
ws (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
ws100 (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
wdir (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
wdir100 (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
Using the advanced indexing rules, if we select from the dataset using DataArrays as indexers, the result will be reshaped to match the indexer. What this means is that we can take your stations dataframe, which has the values time, lat, and lon, and pull the nearest indices from the xarray dataset:
In [16]: ds_over_observations = ds.sel(
...: time=stations["Date-Time"].to_xarray(),
...: lat=stations["Latitude"].to_xarray(),
...: lon=stations["Longitude"].to_xarray(),
...: method="nearest",
...: )
Now, our data has the same index as your dataframe!
In [17]: ds_over_observations
Out[17]:
<xarray.Dataset>
Dimensions: (index: 10)
Coordinates:
lat (index) float64 0.25 0.75 1.75 2.25 ... -5.75 -5.75 -5.75 -5.75
lon (index) float64 -38.25 -38.25 -38.25 ... -35.25 -35.25 -35.25
time (index) datetime64[ns] 2017-11-16 ... 2017-12-11
* index (index) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
temp (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
tempdw (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
sst (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
ws (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
ws100 (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
wdir (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
wdir100 (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
You can dump this into pandas with .to_dataframe:
In [18]: df = ds_over_observations.to_dataframe()
In [19]: df
Out[19]:
lat lon time temp tempdw sst ws ws100 wdir wdir100
index
0 0.25 -38.25 2017-11-16 00:00:00 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724
1 0.75 -38.25 2017-11-16 06:00:00 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025
2 1.75 -38.25 2017-11-16 12:00:00 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417
3 2.25 -38.25 2017-11-16 18:00:00 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019
4 2.75 -38.25 2017-11-17 00:00:00 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266
5 -5.75 -35.25 2017-12-10 00:00:00 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490
6 -5.75 -35.25 2017-12-10 06:00:00 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541
7 -5.75 -35.25 2017-12-10 12:00:00 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352
8 -5.75 -35.25 2017-12-10 18:00:00 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058
9 -5.75 -35.25 2017-12-11 00:00:00 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493
The index in the result is the same one as the stations data. If you'd like, you could merge in the original values using pd.concat([stations, df], axis=1).set_index("Date-Time") to get your original index back, alongside all the weather data:
In [20]: pd.concat([stations, df], axis=1).set_index("Date-Time")
Out[20]:
Latitude Longitude lat lon time temp tempdw sst ws ws100 wdir wdir100
Date-Time
2017-11-16 00:00:00 0.219547 -38.247914 0.25 -38.25 2017-11-16 00:00:00 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724
2017-11-16 06:00:00 0.861717 -38.188858 0.75 -38.25 2017-11-16 06:00:00 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025
2017-11-16 12:00:00 1.529534 -38.131039 1.75 -38.25 2017-11-16 12:00:00 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417
2017-11-16 18:00:00 2.243760 -38.067467 2.25 -38.25 2017-11-16 18:00:00 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019
2017-11-17 00:00:00 2.961202 -38.009050 2.75 -38.25 2017-11-17 00:00:00 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266
2017-12-10 00:00:00 -5.775127 -35.206581 -5.75 -35.25 2017-12-10 00:00:00 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490
2017-12-10 06:00:00 -5.775120 -35.206598 -5.75 -35.25 2017-12-10 06:00:00 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541
2017-12-10 12:00:00 -5.775119 -35.206583 -5.75 -35.25 2017-12-10 12:00:00 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352
2017-12-10 18:00:00 -5.775122 -35.206584 -5.75 -35.25 2017-12-10 18:00:00 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058
2017-12-11 00:00:00 -5.775115 -35.206590 -5.75 -35.25 2017-12-11 00:00:00 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493

Find top 3 highest values across 3 columns row-wise pandas

I have four columns of values consisting of the looking times to a right/left/upper/lower positioned image. Another column shows the position of the image which was chosen (decision_resp). I created a new column showing the looking time of the chosen image. Now I want to create 3 more columns showing the looking times of the not chosen images sorted by highest looking time (top1), second highest looking time (toop2) and third highest looking time (top 3). The looking time of the chosen image has to be excluded.
These are the columns I have:
lookRight_t lookLeft_t lookUp_t lookDown_t decision_resp chosen_img_et
0 1.291667 1.325000 3.025000 1.141667 up 3.025000
1 0.000000 0.000000 1.125000 3.150000 down 3.150000
2 0.000000 0.000000 3.508333 2.275000 up 3.508333
3 3.700000 1.950000 0.000000 0.000000 right 3.700000
4 2.633333 1.316667 1.341667 0.000000 right 2.633333
5 1.766667 1.333333 0.825000 2.208333 down 2.208333
6 0.000000 0.000000 1.108333 5.283333 down 5.283333
My approach was:
# create new column for looking time of chosen image
trials.loc[trials['decision_resp']=='right','chosen_img_et'] = trials['lookRight_t']
trials.loc[trials['decision_resp']=='left','chosen_img_et'] = trials['lookLeft_t']
trials.loc[trials['decision_resp']=='down','chosen_img_et'] = trials['lookDown_t']
trials.loc[trials['decision_resp']=='up','chosen_img_et'] = trials['lookUp_t']
# here I got stuck
trials.loc[trials['decision_resp']=='right', 3 new columns (top1/2/3)] = trials[['lookLeft_t', 'lookDown_t', 'lookUp_t']].find max values and put it in order
trials.loc[trials['decision_resp']=='left', 3 new columns (top1/2/3)] = trials[['lookRight_t', 'lookDown_t', 'lookUp_t']].find max values and put it in order
trials.loc[trials['decision_resp']=='down', 3 new columns (top1/2/3)] = trials[['lookLeft_t', 'lookRight_t', 'lookUp_t']].find max values and put it in order
trials.loc[trials['decision_resp']=='up', 3 new columns (top1/2/3)] = trials[['lookLeft_t', 'lookDown_t', 'lookRight_t']].find max values and put it in order
Thank you for any help!
First you can use Looking up solution for new column chosen_img_et
You can sorting selected columns by numpy.sort and then use indexing for top3 rows without selected value, so is compared columns names by column decision_resp with broadcasting and set missing values by mask by DataFrame.mask:
cols = ['lookRight_t','lookLeft_t','lookUp_t','lookDown_t']
#replace substrings for match column decision_resp
look = [x.replace('look','').replace('_t','').lower() for x in cols]
new = [f'top{x+1}' for x in range(3)]
#lookup
idx1, cols1 = pd.factorize(trials['decision_resp'])
trials['chosen_img_et'] = (trials[cols].set_axis(look, axis=1)
.reindex(cols1, axis=1)
.to_numpy()[np.arange(len(trials)), idx1])
mask = np.array(look) == trials['decision_resp'].to_numpy()[:, None]
#np.sort sorting by default descending,
#so for ascending order use -1 in indexing and 2 is for remove first only NaN column
trials[new] = np.sort(trials[cols].mask(mask), axis=1)[:, 2::-1]
print (trials)
lookRight_t lookLeft_t lookUp_t lookDown_t decision_resp chosen_img_et \
0 1.291667 1.325000 3.025000 1.141667 up 3.025000
1 0.000000 0.000000 1.125000 3.150000 down 3.150000
2 0.000000 0.000000 3.508333 2.275000 up 3.508333
3 3.700000 1.950000 0.000000 0.000000 right 3.700000
4 2.633333 1.316667 1.341667 0.000000 right 2.633333
5 1.766667 1.333333 0.825000 2.208333 down 2.208333
6 0.000000 0.000000 1.108333 5.283333 down 5.283333
top1 top2 top3
0 1.325000 1.291667 1.141667
1 1.125000 0.000000 0.000000
2 2.275000 0.000000 0.000000
3 1.950000 0.000000 0.000000
4 1.341667 1.316667 0.000000
5 1.766667 1.333333 0.825000
6 1.108333 0.000000 0.000000
Details:
print (trials[cols].mask(mask))
lookRight_t lookLeft_t lookUp_t lookDown_t
0 1.291667 1.325000 NaN 1.141667
1 0.000000 0.000000 1.125000 NaN
2 0.000000 0.000000 NaN 2.275000
3 NaN 1.950000 0.000000 0.000000
4 NaN 1.316667 1.341667 0.000000
5 1.766667 1.333333 0.825000 NaN
6 0.000000 0.000000 1.108333 NaN
Find the max column name and extract the direction from it.
Find the max for each row.
Find the remaining 3 values and sort them as desired.
Concat the results together.
Renameing columns as I go.
decision_resp = df.idxmax(axis=1).str.extract('look(\w*)_t', expand=False)
decision_resp.rename('decision_resp', inplace=True)
chosen_img_et = df.max(axis=1, numeric_only=True)
chosen_img_et.rename('chosen_img_et', inplace=True)
top3 = df.apply(lambda x: x.nlargest(4).sort_values(ascending=False, ignore_index=True)[1:], axis=1)
top3.columns = ['top1', 'top2', 'top3']
df = pd.concat([df, decision_resp, chosen_img_et, top3], axis=1)
print(df)
Output:
lookRight_t lookLeft_t lookUp_t lookDown_t decision_resp chosen_img_et \
0 1.291667 1.325000 3.025000 1.141667 Up 3.025000
1 0.000000 0.000000 1.125000 3.150000 Down 3.150000
2 0.000000 0.000000 3.508333 2.275000 Up 3.508333
3 3.700000 1.950000 0.000000 0.000000 Right 3.700000
4 2.633333 1.316667 1.341667 0.000000 Right 2.633333
5 1.766667 1.333333 0.825000 2.208333 Down 2.208333
6 0.000000 0.000000 1.108333 5.283333 Down 5.283333
top1 top2 top3
0 1.325000 1.291667 1.141667
1 1.125000 0.000000 0.000000
2 2.275000 0.000000 0.000000
3 1.950000 0.000000 0.000000
4 1.341667 1.316667 0.000000
5 1.766667 1.333333 0.825000
6 1.108333 0.000000 0.000000
Other way, addressing jezrael's concerns:
col_list = ['lookRight_t', 'lookLeft_t', 'lookUp_t', 'lookDown_t']
idx, cols = pd.factorize('look' + df['decision_resp'].str.title() + '_t')
df['chosen_img_et'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
mask = np.array(col_list) == df['decision_resp'].to_numpy()[:, None]
df[[f'top{x+1}' for x in range(3)]] = np.sort(df[col_list].mask(mask), axis=1)[:, 2::-1]
Output:
lookRight_t lookLeft_t lookUp_t lookDown_t decision_resp chosen_img_et \
0 1.291667 1.325000 3.025000 1.141667 up 3.025000
1 0.000000 0.000000 1.125000 3.150000 down 3.150000
2 0.000000 0.000000 3.508333 2.275000 up 3.508333
3 3.700000 1.950000 0.000000 0.000000 right 3.700000
4 2.633333 1.316667 1.341667 0.000000 right 2.633333
5 1.766667 1.333333 0.825000 2.208333 down 2.208333
6 0.000000 0.000000 1.108333 5.283333 down 5.283333
top1 top2 top3
0 1.325000 1.291667 1.141667
1 1.125000 0.000000 0.000000
2 2.275000 0.000000 0.000000
3 1.950000 0.000000 0.000000
4 1.341667 1.316667 0.000000
5 1.766667 1.333333 0.825000
6 1.108333 0.000000 0.000000

missing observation panel data, bring forward value 20 periods

Here's to read in a DataFrame like the one I'm looking at
pd.DataFrame({
'period' : [1, 2, 3, 4, 5, 8, 9, 10, 11, 13, 14, 15, 16, 19, 20, 21, 22,
23, 25, 26],
'id' : [1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285,
1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285],
'pred': [-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775],
'ret' : [ None, -0.02222222, -0.01363636, 0. , -0.02764977,
None, -0.00909091, -0.01376147, 0.00465116, None,
0.01869159, 0. , 0. , None , -0.00460829,
0.00462963, 0.02304147, 0. , None, -0.00050756]})
Which will look like this when read in.
period id pred ret
0 1 1285 -1.653477 NaN
1 2 1285 -1.653477 -0.022222
2 3 1285 -1.653477 -0.013636
3 4 1285 -1.653477 0.000000
4 5 1285 -1.653477 -0.027650
5 8 1285 -1.653477 NaN
6 9 1285 -1.653477 -0.009091
7 10 1285 -1.653477 -0.013761
8 11 1285 -1.653477 0.004651
9 13 1285 -1.653477 NaN
10 14 1285 -1.653477 0.018692
11 15 1285 -1.653477 0.000000
12 16 1285 -1.653477 0.000000
13 19 1285 -1.653477 NaN
14 20 1285 -1.653477 -0.004608
15 21 1285 -1.653477 0.004630
16 22 1285 -1.653477 0.023041
17 23 1285 -1.653477 0.000000
18 25 1285 -1.653477 NaN
19 26 1285 -1.653477 -0.000508
pred is 20 period prediction and consequently I want to do is bring the returns back 20 days. (but do it in a flexible way.)
Here's the lag function I have presently
def lag(df, col, lag_dist=1, ref='period', group='id'):
df = df.copy()
new_col = 'lag'+str(lag_dist)+'_'+col
df[new_col] = df.groupby(group)[col].shift(lag_dist)
# set NaN values that differ from specified
df[new_col] = (df.groupby(group)[ref]
.shift(lag_dist)
.sub(df[ref])
.eq(-lag_dist)
.mul(1)
.replace(0,np.nan)*df[new_col])
return df[new_col]
but when I run
df['fut20_ret'] = lag(df, 'ret', -20, 'period')
df.head(20)
I get
period id pred gain fee prc ret fut20_ret
0 1 1285 -1.653478 0.000000 0.87 1.000000 NaN NaN
1 2 1285 -1.653478 -0.022222 0.87 0.977778 -0.022222 NaN
2 3 1285 -1.653478 -0.035556 0.87 0.964444 -0.013636 NaN
3 4 1285 -1.653478 -0.035556 0.87 0.964444 0.000000 NaN
4 5 1285 -1.653478 -0.062222 0.87 0.937778 -0.027650 NaN
6 8 1285 -1.653478 -0.022222 0.87 0.977778 NaN NaN
7 9 1285 -1.653478 -0.031111 0.87 0.968889 -0.009091 NaN
8 10 1285 -1.653478 -0.044444 0.87 0.955556 -0.013761 NaN
9 11 1285 -1.653478 -0.040000 0.87 0.960000 0.004651 NaN
10 13 1285 -1.653478 -0.048889 0.87 0.951111 NaN NaN
11 14 1285 -1.653478 -0.031111 0.87 0.968889 0.018692 NaN
12 15 1285 -1.653478 -0.031111 0.87 0.968889 0.000000 NaN
13 16 1285 -1.653478 -0.031111 0.87 0.968889 0.000000 NaN
15 19 1285 -1.653478 -0.035556 0.87 0.964444 NaN NaN
16 20 1285 -1.653478 -0.040000 0.87 0.960000 -0.004608 NaN
17 21 1285 -1.653478 -0.035556 0.87 0.964444 0.004630 NaN
18 22 1285 -1.653478 -0.013333 0.87 0.986667 0.023041 NaN
19 23 1285 -1.653478 -0.013333 0.87 0.986667 0.000000 NaN
How can I modify my lag function so that it works properly? It's close but I'm struggling on the last little bit.

How to calculate the aggregate variance in pivot table

when I use aggfunc = np.var in pivot table. I found the value of metrics became NaN. But when it comes to aggfunc = np.sum it doesn't.
why the original value was changed with aggfunc = np.var or aggfunc = np.std. I can not found answer in the docs. docs of pivot table
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
"bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
"C": ["small", "large", "large", "small",
"small", "large", "small", "small",
"large"],
"D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
"E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
print(df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.sum,
margins=True,
margins_name = 'sum',
dropna = False
))
print('-' * 100)
df = df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.var,
margins=True,
margins_name = 'var',
dropna = False
)
print(df)
D E
C large small sum large small sum
A B
bar one 4.0 5.0 9 6.0 8.0 14
two 7.0 6.0 13 9.0 9.0 18
foo one 4.0 1.0 5 9.0 2.0 11
two NaN 6.0 6 NaN 11.0 11
sum 15.0 18.0 33 24.0 30.0 54
-----------------------------------------------------------------------
D E
C large small var large small var
A B
bar one NaN NaN 0.500000 NaN NaN 2.000000
two NaN NaN 0.500000 NaN NaN 0.000000
foo one 0.000000 NaN 0.333333 0.500000 NaN 2.333333
two NaN 0.0 0.000000 NaN 0.5 0.500000
var 5.583333 3.8 3.555556 4.666667 7.5 4.888889
what's more, I found the var of D = large is np.var([4.0, 7.0, 4.0]) = 2.0 instead of 5.583333.
what I expected is:
D E
C large small var large small var
A B
bar one 4.0 5.0 0.25 6.0 8.0 1.0
two 7.0 6.0 0.25 9.0 9.0 0
foo one 4.0 1.0 2.25 9.0 2.0 12.25
two NaN 6.0 0 NaN 11.0 0.0
var 2.0 4.25 3.6 2.0 11.25 7.34
What is the meaning of aggfunc = np.var in pivot table?
Pandas uses by default ddof = 1, see here for details on np.var.
When you have just one value, then the variance using ddof = 1 will be NaN as you try to divide by zero.
Var of D = large is np.var([2,2,4,7], ddof=1) = 5.583333333333333, so everything is correct (you'll have to use the individual values, not the sums).
If you need var with ddof = 0 then you can provide your own function:
def var0(x):
return np.var(x, ddof=0)
print(df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= var0,
margins=True,
margins_name = 'var',
dropna = False
))
Result:
D E
C large small var large small var
A B
bar one 0.0000 0.00 0.250000 0.00 0.00 1.000000
two 0.0000 0.00 0.250000 0.00 0.00 0.000000
foo one 0.0000 0.00 0.222222 0.25 0.00 1.555556
two NaN 0.00 0.000000 NaN 0.25 0.250000
var 4.1875 3.04 3.555556 3.50 6.00 4.888889
UPDATE based on the edited question.
Pivot table with the sums of C and additionally the var of the sums as margin columns/row.
We first create the sum pivot table with margin columns/row named var. Then we updated these margin columns/row with the var of the sum table:
dfs = df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.sum,
margins=True,
margins_name = 'var',
dropna = False)
dfs[[('D','var'),('E','var')]] = df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.sum,
dropna = False).stack().groupby(level=(0,1)).apply(var0)
dfs.iloc[-1] = dfs.iloc[:-1].apply(var0)
Result:
D E
C large small var large small var
A B
bar one 4.0 5.00 0.250000 6.0 8.00 1.000000
two 7.0 6.00 0.250000 9.0 9.00 0.000000
foo one 4.0 1.00 2.250000 9.0 2.00 12.250000
two NaN 6.00 0.000000 NaN 11.00 0.000000
var 2.0 4.25 0.824219 2.0 11.25 26.792969
In the margin row (last row) the var columns are calculated as the var of the row vars. I don't understand how the OP calculated his values for these two cells. Anyway they don't seem to make much sense.

Select subset by a conditional expression from a PANDAS dataframe , but a error

a sample like this :
In [39]: ts = pd.Series(np.random.randn(20),index=pd.date_range('1/1/2000',periods=20))
In [40]: t = pd.DataFrame(ts,columns=['base'],index=ts.index)
In [42]: t['shift_one'] = t.base - t.base.shift(1)
In [43]: t['shift_two'] = t.shift_one.shift(1)
In [44]: t
Out[44]:
base shift_one shift_two
2000-01-01 -1.239924 NaN NaN
2000-01-02 1.116260 2.356184 NaN
2000-01-03 0.401853 -0.714407 2.356184
2000-01-04 -0.823275 -1.225128 -0.714407
2000-01-05 -0.562103 0.261171 -1.225128
2000-01-06 0.347143 0.909246 0.261171
.............
2000-01-20 -0.062557 -0.467466 0.512293
now , if we use t[t.shift_one > 0 ] , it works ok ,but when we use:
In [48]: t[t.shift_one > 0 and t.shift_two < 0]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in ()
----> 1 t[t.shift_one > 0 and t.shift_two < 0]
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Suppose we want to get a subset that include both two conditions , how to ? thanks a lot.
you need parens and use &, not and
see docs here:
http://pandas.pydata.org/pandas-docs/dev/indexing.html#boolean-indexing
In [3]: ts = pd.Series(np.random.randn(20),index=pd.date_range('1/1/2000',periods=20))
In [4]: t = pd.DataFrame(ts,columns=['base'],index=ts.index)
In [5]: t['shift_one'] = t.base - t.base.shift(1)
In [6]: t['shift_two'] = t.shift_one.shift(1)
In [7]: t
Out[7]:
base shift_one shift_two
2000-01-01 -1.116040 NaN NaN
2000-01-02 1.592079 2.708118 NaN
2000-01-03 0.958642 -0.633436 2.708118
2000-01-04 0.431970 -0.526672 -0.633436
2000-01-05 1.275624 0.843654 -0.526672
2000-01-06 0.498401 -0.777223 0.843654
2000-01-07 -0.351465 -0.849865 -0.777223
2000-01-08 -0.458742 -0.107277 -0.849865
2000-01-09 -2.100404 -1.641662 -0.107277
2000-01-10 0.601658 2.702062 -1.641662
2000-01-11 -2.026495 -2.628153 2.702062
2000-01-12 0.391426 2.417921 -2.628153
2000-01-13 -1.177292 -1.568718 2.417921
2000-01-14 -0.374543 0.802749 -1.568718
2000-01-15 0.338649 0.713192 0.802749
2000-01-16 -1.124820 -1.463469 0.713192
2000-01-17 0.484175 1.608995 -1.463469
2000-01-18 -1.477772 -1.961947 1.608995
2000-01-19 0.481843 1.959615 -1.961947
2000-01-20 0.760168 0.278325 1.959615
In [8]: t[(t.shift_one>0) & (t.shift_two<0)]
Out[8]:
base shift_one shift_two
2000-01-05 1.275624 0.843654 -0.526672
2000-01-10 0.601658 2.702062 -1.641662
2000-01-12 0.391426 2.417921 -2.628153
2000-01-14 -0.374543 0.802749 -1.568718
2000-01-17 0.484175 1.608995 -1.463469
2000-01-19 0.481843 1.959615 -1.961947