Pandas - Add a new calculated column to a MultiIndex column dataframe

Pandas - Add a new calculated column to a MultiIndex column dataframe - pandas

I have a Dataframe with the following structure:
np.random.seed(1)
mi = pd.MultiIndex.from_product([[3, 5], ["X","Y","V","T"]], names=["Node", "Parameter"])
df = pd.DataFrame(index=pd.DatetimeIndex(['2022-07-07 12:00:00', '2022-07-07 13:00:00',
'2022-07-07 14:00:00', '2022-07-07 15:00:00',
'2022-07-07 16:00:00'],
dtype='datetime64[ns]', name='Date', freq=None), columns=mi, data=np.random.rand(5,8))
print(df)
Node 3 5
Parameter X Y V T X Y V T
Date
2022-07-07 12:00:00 0.417022 0.720324 0.000114 0.302333 0.146756 0.092339 0.186260 0.345561
2022-07-07 13:00:00 0.396767 0.538817 0.419195 0.685220 0.204452 0.878117 0.027388 0.670468
2022-07-07 14:00:00 0.417305 0.558690 0.140387 0.198101 0.800745 0.968262 0.313424 0.692323
2022-07-07 15:00:00 0.876389 0.894607 0.085044 0.039055 0.169830 0.878143 0.098347 0.421108
2022-07-07 16:00:00 0.957890 0.533165 0.691877 0.315516 0.686501 0.834626 0.018288 0.750144
I would like to add a new calculated column "Z" for each Node, based on the value "X" ** 2 + "Y" ** 2.
The following achieves the desired result:
x = df.loc[:,(slice(None),"X")]
y = df.loc[:,(slice(None),"Y")]
z = (x**2).rename(columns={"X":"Z"}) + (y ** 2).rename(columns={"Y":"Z"})
result = df.join(z).sort_index(axis=1)
Is there a more straightforward way to achieve this?
For example, using df.xs to select the desired column data e.g. df.xs("X", axis=1, level=1) **2 + df.xs("X", axis=1, level=1) ** 2, how can I then assign the result to the original dataframe?

You can use groupby.apply:
(df.groupby(level='Node', axis=1)
.apply(lambda g: g.droplevel('Node', axis=1).eval('Z = X**2 + Y**2'))
)
Or, with xs and drop_level=False on one of the values:
(df.join((df.xs('X', axis=1, level=1, drop_level=False)**2
+df.xs('Y', axis=1, level=1)**2
).rename(columns={'X': 'Z'}, level=1)
)
.sort_index(axis=1, level=0, sort_remaining=False)
)
Output:
Node 3 5
Parameter X Y V T Z X Y V T Z
Date
2022-07-07 12:00:00 0.417022 0.720324 0.000114 0.302333 0.692775 0.146756 0.092339 0.186260 0.345561 0.030064
2022-07-07 13:00:00 0.396767 0.538817 0.419195 0.685220 0.447748 0.204452 0.878117 0.027388 0.670468 0.812891
2022-07-07 14:00:00 0.417305 0.558690 0.140387 0.198101 0.486278 0.800745 0.968262 0.313424 0.692323 1.578722
2022-07-07 15:00:00 0.876389 0.894607 0.085044 0.039055 1.568379 0.169830 0.878143 0.098347 0.421108 0.799977
2022-07-07 16:00:00 0.957890 0.533165 0.691877 0.315516 1.201818 0.686501 0.834626 0.018288 0.750144 1.167884

One option is with pd.xs:
out = df.xs('X',axis=1,level=1).pow(2).add(df.xs('Y',axis=1,level=1).pow(2))
out.columns = [out.columns, np.repeat(['Z'], 2)]
pd.concat([df, out], axis = 1).sort_index(axis=1)
Node 3 5
Parameter T V X Y Z T V X Y Z
Date
2022-07-07 12:00:00 0.302333 0.000114 0.417022 0.720324 0.692775 0.345561 0.186260 0.146756 0.092339 0.030064
2022-07-07 13:00:00 0.685220 0.419195 0.396767 0.538817 0.447748 0.670468 0.027388 0.204452 0.878117 0.812891
2022-07-07 14:00:00 0.198101 0.140387 0.417305 0.558690 0.486278 0.692323 0.313424 0.800745 0.968262 1.578722
2022-07-07 15:00:00 0.039055 0.085044 0.876389 0.894607 1.568379 0.421108 0.098347 0.169830 0.878143 0.799977
2022-07-07 16:00:00 0.315516 0.691877 0.957890 0.533165 1.201818 0.750144 0.018288 0.686501 0.834626 1.167884
Another option, is to select all the columns, run pow across all the columns in one go, before grouping and concatenating:
out = (df
.loc(axis=1)[:, ['X','Y']]
.pow(2)
.groupby(level='Node', axis=1)
.agg(np.add.reduce,axis=1))
out.columns = [out.columns, np.repeat(['Z'], 2)]
pd.concat([df, out], axis = 1).sort_index(axis=1)

Related

Extract a time and space variable from a moving ship from the ERA5 reanalysis

I want to extract the measured wind from a station inside a moving ship, which I have the latitude, longitude and time values and the wind value for each time step in space. I can extract a fixed point in space for all time steps but I would like to extract for example the wind at time step x to a date longitude and latitude as the ship moves. How can I do this from the code below?
data = xr.open_dataset('C:/Users/William Jacondino/Desktop/Dados/ERA5\\ERA5_2017.nc', decode_times=False)
dir_out = 'C:/Users/William Jacondino/Desktop/MovingShip'
if not os.path.exists(dir_out):
os.makedirs(dir_out)
print("\nReading the observation station names:\n")
stations = pd.read_csv(r"C:/Users/William Jacondino/Desktop/MovingShip/Date-TIME.csv",index_col=0, sep='\;')
print(stations)
Reading the observation station names:
Latitude Longitude
Date-Time
16/11/2017 00:00 0.219547 -38.247914
16/11/2017 06:00 0.861717 -38.188858
16/11/2017 12:00 1.529534 -38.131039
16/11/2017 18:00 2.243760 -38.067467
17/11/2017 00:00 2.961202 -38.009050
... ... ...
10/12/2017 00:00 -5.775127 -35.206581
10/12/2017 06:00 -5.775120 -35.206598
10/12/2017 12:00 -5.775119 -35.206583
10/12/2017 18:00 -5.775122 -35.206584
11/12/2017 00:00 -5.775115 -35.206590
# variável tempo e unidade
times = data.variables['time'][:]
unit = data.time.units
# variáveis latitude (lat) e longitude (lon)
lon = data.variables['longitude'][:]
lat = data.variables['latitude'][:]
# variável temperatura em 2 metros em celsius
temp = data.variables['t2m'][:]-275.15
# variável temperatura do ponto de orvalho em 2 metros em celsius
tempdw = data.variables['d2m'][:]-275.15
# variável sea surface temperature (sst) em celsius
sst = data.variables['sst'][:]-275.15
# variável Surface sensible heat flux sshf
sshf = data.variables['sshf'][:]
unitsshf = data.sshf.units
# variável Surface latent heat flux
slhf = data.variables['slhf'][:]
unitslhf = data.slhf.units
# variável Mean sea level pressure
msl = data.variables['msl'][:]/100
unitmsl = data.msl.units
# variável Total precipitation em mm/h
tp = data.variables['tp'][:]*1000
# componente zonal do vento em 100 metros
uten100 = data.variables['u100'][:]
unitu100 = data.u100.units
# componente meridional do vento em 100 metros
vten100 = data.variables['v100'][:]
unitv100 = data.v100.units
# componente zonal do vento em 10 metros
uten = data.variables['u10'][:]
unitu = data.u10.units
# componente meridional do vento em 10 metros
vten = data.variables['v10'][:]
unitv = data.v10.units
# calculando a velocidade do vento em 10 metros
ws = (uten**2 + vten**2)**(0.5)
# calculando a velocidade do vento em 100 metros
ws100 = (uten100**2 + vten100**2)**(0.5)
# calculando os ângulos de U e V para obter a direção do vento em 10 metros
wdir = (180 + (np.degrees(np.arctan2(uten, vten)))) % 360
# calculando os ângulos de U e V para obter a direção do vento em 100 metros
wdir100 = (180 + (np.degrees(np.arctan2(uten100, vten100)))) % 360
for key, value in stations.iterrows():
#print(key,value[0], value[1], value[2])
station = value[0]
file_name = "{}{}".format(station+'_1991',".csv")
#print(file_name)
lon_point = value[1]
lat_point = value[2]
########################################
# Encontrando o ponto de Latitude e Longitude mais próximo das estações
# Squared difference of lat and lon
sq_diff_lat = (lat - lat_point)**2
sq_diff_lon = (lon - lon_point)**2
# Identifying the index of the minimum value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
print("Generating time series for station {}".format(station))
ref_date = datetime.datetime(int(unit[12:16]),int(unit[17:19]),int(unit[20:22]))
date_range = list()
temp_data = list()
tempdw_data = list()
sst_data = list()
sshf_data = list()
slhf_data = list()
msl_data = list()
tp_data = list()
uten100_data = list()
vten100_data = list()
uten_data = list()
vten_data = list()
ws_data = list()
ws100_data = list()
wdir_data = list()
wdir100_data = list()
for index, time in enumerate(times):
date_time = ref_date+datetime.timedelta(hours=int(time))
date_range.append(date_time)
temp_data.append(temp[index, min_index_lat, min_index_lon].values)
tempdw_data.append(tempdw[index, min_index_lat, min_index_lon].values)
sst_data.append(sst[index, min_index_lat, min_index_lon].values)
sshf_data.append(sshf[index, min_index_lat, min_index_lon].values)
slhf_data.append(slhf[index, min_index_lat, min_index_lon].values)
msl_data.append(msl[index, min_index_lat, min_index_lon].values)
tp_data.append(tp[index, min_index_lat, min_index_lon].values)
uten100_data.append(uten100[index, min_index_lat, min_index_lon].values)
vten100_data.append(vten100[index, min_index_lat, min_index_lon].values)
uten_data.append(uten[index, min_index_lat, min_index_lon].values)
vten_data.append(vten[index, min_index_lat, min_index_lon].values)
ws_data.append(ws[index,min_index_lat,min_index_lon].values)
ws100_data.append(ws100[index,min_index_lat,min_index_lon].values)
wdir_data.append(wdir[index,min_index_lat,min_index_lon].values)
wdir100_data.append(wdir100[index,min_index_lat,min_index_lon].values)
################################################################################
#print(date_range)
df = pd.DataFrame(date_range, columns = ["Date-Time"])
df["Date-Time"] = date_range
df = df.set_index(["Date-Time"])
df["WS10 ({})".format(unitu)] = ws_data
df["WDIR10 ({})".format(units.deg)] = wdir_data
df["WS100 ({})".format(unitu)] = ws100_data
df["WDIR100 ({})".format(units.deg)] = wdir100_data
df["Chuva({})".format(units.mm)] = tp_data
df["MSLP ({})".format(units.hPa)] = msl_data
df["T2M ({})".format(units.degC)] = temp_data
df["Td2M ({})".format(units.degC)] = tempdw_data
df["Surface Sensible Heat Flux ({})".format(unitsshf)] = sshf_data
df["Surface latent heat flux ({})".format(unitslhf)] = slhf_data
df["U10 ({})".format(unitu)] = uten_data
df["V10 ({})".format(unitv)] = vten_data
df["U100 ({})".format(unitu100)] = uten100_data
df["V100 ({})".format(unitv100)] = vten100_data
df["TSM ({})".format(units.degC)] = sst_data
print("The following time series is being saved as .csv files")
df.to_csv(os.path.join(dir_out,file_name), sep=';',encoding="utf-8", index=True)
print("\n! !Successfuly saved all the Time Series the output Directory!!\n{}".format(dir_out))
My code to extract a fixed variable at a given point in space is like this, but I would like to extract during the ship's movement, for example at time 11/12/2017 00:00, latitude -5.775115 and longitude -35.206590 I have a value of the wind, and in the next time step for another latitude x longitude I have another value. How can I adapt my code for this?

This is another perfect use case for xarray's advanced indexing! I feel like this part of the user guide needs a cape and a theme song :)
I'll use a made up dataset and set of stations which are similar (I think) to yours. First step is to reset your Date-Time index, so you can use it in pulling the nearest time value from the xarray.Dataset, since you want a common index for the time, lat, and lon:
In [14]: stations = stations.reset_index(drop=False)
...: stations
Out[14]:
Date-Time Latitude Longitude
0 2017-11-16 00:00:00 0.219547 -38.247914
1 2017-11-16 06:00:00 0.861717 -38.188858
2 2017-11-16 12:00:00 1.529534 -38.131039
3 2017-11-16 18:00:00 2.243760 -38.067467
4 2017-11-17 00:00:00 2.961202 -38.009050
5 2017-12-10 00:00:00 -5.775127 -35.206581
6 2017-12-10 06:00:00 -5.775120 -35.206598
7 2017-12-10 12:00:00 -5.775119 -35.206583
8 2017-12-10 18:00:00 -5.775122 -35.206584
9 2017-12-11 00:00:00 -5.775115 -35.206590
In [15]: ds
Out[15]:
<xarray.Dataset>
Dimensions: (lat: 40, lon: 40, time: 241)
Coordinates:
* lat (lat) float64 -9.75 -9.25 -8.75 -8.25 -7.75 ... 8.25 8.75 9.25 9.75
* lon (lon) float64 -44.75 -44.25 -43.75 -43.25 ... -26.25 -25.75 -25.25
* time (time) datetime64[ns] 2017-11-01 2017-11-01T06:00:00 ... 2017-12-31
Data variables:
temp (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
tempdw (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
sst (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
ws (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
ws100 (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
wdir (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
wdir100 (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
Using the advanced indexing rules, if we select from the dataset using DataArrays as indexers, the result will be reshaped to match the indexer. What this means is that we can take your stations dataframe, which has the values time, lat, and lon, and pull the nearest indices from the xarray dataset:
In [16]: ds_over_observations = ds.sel(
...: time=stations["Date-Time"].to_xarray(),
...: lat=stations["Latitude"].to_xarray(),
...: lon=stations["Longitude"].to_xarray(),
...: method="nearest",
...: )
Now, our data has the same index as your dataframe!
In [17]: ds_over_observations
Out[17]:
<xarray.Dataset>
Dimensions: (index: 10)
Coordinates:
lat (index) float64 0.25 0.75 1.75 2.25 ... -5.75 -5.75 -5.75 -5.75
lon (index) float64 -38.25 -38.25 -38.25 ... -35.25 -35.25 -35.25
time (index) datetime64[ns] 2017-11-16 ... 2017-12-11
* index (index) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
temp (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
tempdw (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
sst (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
ws (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
ws100 (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
wdir (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
wdir100 (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
You can dump this into pandas with .to_dataframe:
In [18]: df = ds_over_observations.to_dataframe()
In [19]: df
Out[19]:
lat lon time temp tempdw sst ws ws100 wdir wdir100
index
0 0.25 -38.25 2017-11-16 00:00:00 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724
1 0.75 -38.25 2017-11-16 06:00:00 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025
2 1.75 -38.25 2017-11-16 12:00:00 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417
3 2.25 -38.25 2017-11-16 18:00:00 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019
4 2.75 -38.25 2017-11-17 00:00:00 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266
5 -5.75 -35.25 2017-12-10 00:00:00 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490
6 -5.75 -35.25 2017-12-10 06:00:00 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541
7 -5.75 -35.25 2017-12-10 12:00:00 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352
8 -5.75 -35.25 2017-12-10 18:00:00 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058
9 -5.75 -35.25 2017-12-11 00:00:00 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493
The index in the result is the same one as the stations data. If you'd like, you could merge in the original values using pd.concat([stations, df], axis=1).set_index("Date-Time") to get your original index back, alongside all the weather data:
In [20]: pd.concat([stations, df], axis=1).set_index("Date-Time")
Out[20]:
Latitude Longitude lat lon time temp tempdw sst ws ws100 wdir wdir100
Date-Time
2017-11-16 00:00:00 0.219547 -38.247914 0.25 -38.25 2017-11-16 00:00:00 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724
2017-11-16 06:00:00 0.861717 -38.188858 0.75 -38.25 2017-11-16 06:00:00 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025
2017-11-16 12:00:00 1.529534 -38.131039 1.75 -38.25 2017-11-16 12:00:00 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417
2017-11-16 18:00:00 2.243760 -38.067467 2.25 -38.25 2017-11-16 18:00:00 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019
2017-11-17 00:00:00 2.961202 -38.009050 2.75 -38.25 2017-11-17 00:00:00 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266
2017-12-10 00:00:00 -5.775127 -35.206581 -5.75 -35.25 2017-12-10 00:00:00 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490
2017-12-10 06:00:00 -5.775120 -35.206598 -5.75 -35.25 2017-12-10 06:00:00 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541
2017-12-10 12:00:00 -5.775119 -35.206583 -5.75 -35.25 2017-12-10 12:00:00 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352
2017-12-10 18:00:00 -5.775122 -35.206584 -5.75 -35.25 2017-12-10 18:00:00 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058
2017-12-11 00:00:00 -5.775115 -35.206590 -5.75 -35.25 2017-12-11 00:00:00 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493

How to filter a dataframe given a specific daily hour?

Given the two data frames:
df1:
datetime v
2020-10-01 12:00:00 15
2020-10-02 4
2020-10-03 07:00:00 3
2020-10-03 08:01:00 51
2020-10-03 09:00:00 9
df2:
datetime p
2020-10-01 11:00:00 1
2020-10-01 12:00:00 2
2020-10-02 13:00:00 14
2020-10-02 13:01:00 5
2020-10-03 20:00:00 12
2020-10-03 02:01:00 30
2020-10-03 07:00:00 7
I want to merge these two dataframes into one, and the policy is looking up the nearest value around 08:00 daily. The final result should be
datetime v p
2020-10-01 08:00:00 15 1
2020-10-02 08:00:00 4 14
2020-10-03 08:00:00 51 7
How can I implement this?

Given the following dataframes:
import pandas as pd
df1 = pd.DataFrame(
{
"datetime": [
"2020-10-01 12:00:00",
"2020-10-02",
"2020-10-03 07:00:00",
"2020-10-03 08:01:00",
"2020-10-03 09:00:00",
],
"v": [15, 4, 3, 51, 9],
}
)
df2 = pd.DataFrame(
{
"datetime": [
"2020-10-01 11:00:00",
"2020-10-01 12:00:00",
"2020-10-02 13:00:00",
"2020-10-02 13:01:00",
"2020-10-03 20:00:00",
"2020-10-03 02:01:00",
"2020-10-03 07:00:00",
],
"p": [1, 2, 14, 5, 12, 30, 7],
}
)
You can define a helper function:
def align(df):
# Set proper type
df["datetime"] = pd.to_datetime(df["datetime"])
# Slice df by day
dfs = [
df.copy().loc[df["datetime"].dt.date == item, :]
for item in df["datetime"].dt.date.unique()
]
# Evaluate distance in seconds between given hour and 08:00:00 and filter on min
for i, df in enumerate(dfs):
df["target"] = pd.to_datetime(df["datetime"].dt.date.astype(str) + " 08:00:00")
df["distance"] = (
df["target"].map(lambda x: x.hour * 3600 + x.minute * 60 + x.second)
- df["datetime"].map(lambda x: x.hour * 3600 + x.minute * 60 + x.second)
).abs()
dfs[i] = df.loc[df["distance"].idxmin(), :]
# Concatenate filtered dataframes
df = (
pd.concat(dfs, axis=1)
.T.drop(columns=["datetime", "distance"])
.rename(columns={"target": "datetime"})
.set_index("datetime")
)
return df
To apply on df1 and df2 and then merge:
df = pd.merge(
right=align(df1), left=align(df2), how="outer", right_index=True, left_index=True
).reindex(columns=["v", "p"])
print(df)
# Output
v p
datetime
2020-10-01 08:00:00 15 1
2020-10-02 08:00:00 4 14
2020-10-03 08:00:00 51 7

Pandas time re-sampling categorical data from a column with calculations from another numerical column

I have a data-frame with a categorical column and a numerical , the index set to time data
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
the above code gives:
Kind Values
date
2013-03-01 A 1.0
2013-03-02 B 1.5
2013-03-01 A 2.0
2013-03-02 B 3.0
2013-03-01 B 5.0
2013-03-02 A 3.0
My aim is to achieve the below data-frame:
A_count B_count A_Val max B_Val max
date
2013-03-01 2 1 2 5
2013-03-02 0 3 0 3
Which also has the time as index . Here, I note that If we use
data = pd.DataFrame(data.resample('D')['Pack'].value_counts())
we get :
Kind
date Kind
2013-03-01 A 2
B 1
2013-03-02 B 3

Use DataFrame.pivot_table with flattening MultiIndex in columns in list comprehension:
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
#is possible omit
#df = df.set_index('date')
df = df.pivot_table(index='date', columns='Kind', values='Values', aggfunc=['count','max'])
df.columns = [f'{b}_{a}' for a, b in df.columns]
print (df)
A_count B_count A_max B_max
date
2013-03-01 2.0 1.0 2.0 5.0
2013-03-02 NaN 3.0 NaN 3.0
Another solution with Grouper for resample by days:
df = df.set_index('date')
df = df.groupby([pd.Grouper(freq='d'), 'Kind'])['Values'].agg(['count','max']).unstack()
df.columns = [f'{b}_{a}' for a, b in df.columns]

How to use if conditions in Pandas?

I am working on pandas and I have four column
Name Sensex_index Start_Date End_Date
AAA 0.5 20/08/2016 25/09/2016
AAA 0.8 26/08/2016 29/08/2016
AAA 0.4 30/08/2016 31/08/2016
AAA 0.9 01/09/2016 05/09/2016
AAA 0.5 12/09/2016 22/09/2016
AAA 0.3 24/09/2016 29/09/2016
ABC 0.9 01/01/2017 15/01/2017
ABC 0.5 23/01/2017 30/01/2017
ABC 0.7 02/02/2017 15/03/2017
If the sensex index of same name increases from lower index and moves to higher index, then the Termination date is the previous value, for example, I am looking for the following output,
Name Sensex_index Actual_Start Termination_Date
AAA 0.5 20/08/2016 31/08/2016
AAA 0.8 20/08/2016 31/08/2016
AAA 0.4 20/08/2016 31/08/2016 [high to low; low to high,terminate]
AAA 0.9 01/09/2016 29/09/2016
AAA 0.5 01/09/2016 29/09/2016
AAA 0.3 01/09/2016 29/09/2016 [end of AAA]
ABC 0.9 01/01/2017 30/01/2017
ABC 0.5 01/01/2017 30/01/2017 [high to low; low to high,terminate]
ABC 0.7 02/02/2017 15/03/2017 [end of ABC]

#Setup
df = pd.DataFrame(data = [['AAA', 0.5, '20/08/2016', '25/09/2016'],
['AAA', 0.8, '26/08/2016', '29/08/2016'],
['AAA', 0.4, '30/08/2016', '31/08/2016'],
['AAA', 0.9, '01/09/2016', '05/09/2016'],
['AAA', 0.5, '12/09/2016', '22/09/2016'],
['AAA', 0.3, '24/09/2016', '29/09/2016'],
['ABC', 0.9, '01/01/2017', '15/01/2017'],
['ABC', 0.5, '23/01/2017', '30/01/2017'],
['ABC', 0.7, '02/02/2017', '15/03/2017']], columns = ['Name', 'Sensex_index', 'Start_Date', 'End_Date'])
#Find the rows where price change from high to low and then to high
df['change'] = df.groupby('Name')['Sensex_index'].apply(lambda x: x.rolling(3,center=True).apply(lambda y: True if (y[1]<y[0] and y[1]<y[2]) else False))
#Find the last row for each name
df.iloc[df.groupby('Name')['change'].tail(1).index, -1] = 1.0
#Set End_Date as Termination_Date for those changing points
df['Termination_Date'] = df.apply(lambda x: x.End_Date if x.change>0 else np.nan, axis=1)
#Set Actual_Start
df['Actual_Start'] = df.apply(lambda x: x.Start_Date if (x.name==0
or x.Name!= df.iloc[x.name-1]['Name']
or df.iloc[x.name-1]['change']>0)
else np.nan, axis=1)
#back fill the Termination_Date for other rows.
df.Termination_Date.fillna(method='bfill', inplace=True)
#forward fill the Actual_Start for other rows.
df.Actual_Start.fillna(method='ffill', inplace=True)
print(df)

Select subset by a conditional expression from a PANDAS dataframe , but a error

a sample like this :
In [39]: ts = pd.Series(np.random.randn(20),index=pd.date_range('1/1/2000',periods=20))
In [40]: t = pd.DataFrame(ts,columns=['base'],index=ts.index)
In [42]: t['shift_one'] = t.base - t.base.shift(1)
In [43]: t['shift_two'] = t.shift_one.shift(1)
In [44]: t
Out[44]:
base shift_one shift_two
2000-01-01 -1.239924 NaN NaN
2000-01-02 1.116260 2.356184 NaN
2000-01-03 0.401853 -0.714407 2.356184
2000-01-04 -0.823275 -1.225128 -0.714407
2000-01-05 -0.562103 0.261171 -1.225128
2000-01-06 0.347143 0.909246 0.261171
.............
2000-01-20 -0.062557 -0.467466 0.512293
now , if we use t[t.shift_one > 0 ] , it works ok ,but when we use:
In [48]: t[t.shift_one > 0 and t.shift_two < 0]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in ()
----> 1 t[t.shift_one > 0 and t.shift_two < 0]
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Suppose we want to get a subset that include both two conditions , how to ? thanks a lot.

you need parens and use &, not and
see docs here:
http://pandas.pydata.org/pandas-docs/dev/indexing.html#boolean-indexing
In [3]: ts = pd.Series(np.random.randn(20),index=pd.date_range('1/1/2000',periods=20))
In [4]: t = pd.DataFrame(ts,columns=['base'],index=ts.index)
In [5]: t['shift_one'] = t.base - t.base.shift(1)
In [6]: t['shift_two'] = t.shift_one.shift(1)
In [7]: t
Out[7]:
base shift_one shift_two
2000-01-01 -1.116040 NaN NaN
2000-01-02 1.592079 2.708118 NaN
2000-01-03 0.958642 -0.633436 2.708118
2000-01-04 0.431970 -0.526672 -0.633436
2000-01-05 1.275624 0.843654 -0.526672
2000-01-06 0.498401 -0.777223 0.843654
2000-01-07 -0.351465 -0.849865 -0.777223
2000-01-08 -0.458742 -0.107277 -0.849865
2000-01-09 -2.100404 -1.641662 -0.107277
2000-01-10 0.601658 2.702062 -1.641662
2000-01-11 -2.026495 -2.628153 2.702062
2000-01-12 0.391426 2.417921 -2.628153
2000-01-13 -1.177292 -1.568718 2.417921
2000-01-14 -0.374543 0.802749 -1.568718
2000-01-15 0.338649 0.713192 0.802749
2000-01-16 -1.124820 -1.463469 0.713192
2000-01-17 0.484175 1.608995 -1.463469
2000-01-18 -1.477772 -1.961947 1.608995
2000-01-19 0.481843 1.959615 -1.961947
2000-01-20 0.760168 0.278325 1.959615
In [8]: t[(t.shift_one>0) & (t.shift_two<0)]
Out[8]:
base shift_one shift_two
2000-01-05 1.275624 0.843654 -0.526672
2000-01-10 0.601658 2.702062 -1.641662
2000-01-12 0.391426 2.417921 -2.628153
2000-01-14 -0.374543 0.802749 -1.568718
2000-01-17 0.484175 1.608995 -1.463469
2000-01-19 0.481843 1.959615 -1.961947

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas - Add a new calculated column to a MultiIndex column dataframe - pandas

Related

Extract a time and space variable from a moving ship from the ERA5 reanalysis

How to filter a dataframe given a specific daily hour?

Pandas time re-sampling categorical data from a column with calculations from another numerical column

How to use if conditions in Pandas?

Select subset by a conditional expression from a PANDAS dataframe , but a error

Categories

Resources