How to filter a dataframe given a specific daily hour?

How to filter a dataframe given a specific daily hour? - pandas

Given the two data frames:
df1:
datetime v
2020-10-01 12:00:00 15
2020-10-02 4
2020-10-03 07:00:00 3
2020-10-03 08:01:00 51
2020-10-03 09:00:00 9
df2:
datetime p
2020-10-01 11:00:00 1
2020-10-01 12:00:00 2
2020-10-02 13:00:00 14
2020-10-02 13:01:00 5
2020-10-03 20:00:00 12
2020-10-03 02:01:00 30
2020-10-03 07:00:00 7
I want to merge these two dataframes into one, and the policy is looking up the nearest value around 08:00 daily. The final result should be
datetime v p
2020-10-01 08:00:00 15 1
2020-10-02 08:00:00 4 14
2020-10-03 08:00:00 51 7
How can I implement this?

Given the following dataframes:
import pandas as pd
df1 = pd.DataFrame(
{
"datetime": [
"2020-10-01 12:00:00",
"2020-10-02",
"2020-10-03 07:00:00",
"2020-10-03 08:01:00",
"2020-10-03 09:00:00",
],
"v": [15, 4, 3, 51, 9],
}
)
df2 = pd.DataFrame(
{
"datetime": [
"2020-10-01 11:00:00",
"2020-10-01 12:00:00",
"2020-10-02 13:00:00",
"2020-10-02 13:01:00",
"2020-10-03 20:00:00",
"2020-10-03 02:01:00",
"2020-10-03 07:00:00",
],
"p": [1, 2, 14, 5, 12, 30, 7],
}
)
You can define a helper function:
def align(df):
# Set proper type
df["datetime"] = pd.to_datetime(df["datetime"])
# Slice df by day
dfs = [
df.copy().loc[df["datetime"].dt.date == item, :]
for item in df["datetime"].dt.date.unique()
]
# Evaluate distance in seconds between given hour and 08:00:00 and filter on min
for i, df in enumerate(dfs):
df["target"] = pd.to_datetime(df["datetime"].dt.date.astype(str) + " 08:00:00")
df["distance"] = (
df["target"].map(lambda x: x.hour * 3600 + x.minute * 60 + x.second)
- df["datetime"].map(lambda x: x.hour * 3600 + x.minute * 60 + x.second)
).abs()
dfs[i] = df.loc[df["distance"].idxmin(), :]
# Concatenate filtered dataframes
df = (
pd.concat(dfs, axis=1)
.T.drop(columns=["datetime", "distance"])
.rename(columns={"target": "datetime"})
.set_index("datetime")
)
return df
To apply on df1 and df2 and then merge:
df = pd.merge(
right=align(df1), left=align(df2), how="outer", right_index=True, left_index=True
).reindex(columns=["v", "p"])
print(df)
# Output
v p
datetime
2020-10-01 08:00:00 15 1
2020-10-02 08:00:00 4 14
2020-10-03 08:00:00 51 7

Related

Pandas - Add a new calculated column to a MultiIndex column dataframe

I have a Dataframe with the following structure:
np.random.seed(1)
mi = pd.MultiIndex.from_product([[3, 5], ["X","Y","V","T"]], names=["Node", "Parameter"])
df = pd.DataFrame(index=pd.DatetimeIndex(['2022-07-07 12:00:00', '2022-07-07 13:00:00',
'2022-07-07 14:00:00', '2022-07-07 15:00:00',
'2022-07-07 16:00:00'],
dtype='datetime64[ns]', name='Date', freq=None), columns=mi, data=np.random.rand(5,8))
print(df)
Node 3 5
Parameter X Y V T X Y V T
Date
2022-07-07 12:00:00 0.417022 0.720324 0.000114 0.302333 0.146756 0.092339 0.186260 0.345561
2022-07-07 13:00:00 0.396767 0.538817 0.419195 0.685220 0.204452 0.878117 0.027388 0.670468
2022-07-07 14:00:00 0.417305 0.558690 0.140387 0.198101 0.800745 0.968262 0.313424 0.692323
2022-07-07 15:00:00 0.876389 0.894607 0.085044 0.039055 0.169830 0.878143 0.098347 0.421108
2022-07-07 16:00:00 0.957890 0.533165 0.691877 0.315516 0.686501 0.834626 0.018288 0.750144
I would like to add a new calculated column "Z" for each Node, based on the value "X" ** 2 + "Y" ** 2.
The following achieves the desired result:
x = df.loc[:,(slice(None),"X")]
y = df.loc[:,(slice(None),"Y")]
z = (x**2).rename(columns={"X":"Z"}) + (y ** 2).rename(columns={"Y":"Z"})
result = df.join(z).sort_index(axis=1)
Is there a more straightforward way to achieve this?
For example, using df.xs to select the desired column data e.g. df.xs("X", axis=1, level=1) **2 + df.xs("X", axis=1, level=1) ** 2, how can I then assign the result to the original dataframe?

You can use groupby.apply:
(df.groupby(level='Node', axis=1)
.apply(lambda g: g.droplevel('Node', axis=1).eval('Z = X**2 + Y**2'))
)
Or, with xs and drop_level=False on one of the values:
(df.join((df.xs('X', axis=1, level=1, drop_level=False)**2
+df.xs('Y', axis=1, level=1)**2
).rename(columns={'X': 'Z'}, level=1)
)
.sort_index(axis=1, level=0, sort_remaining=False)
)
Output:
Node 3 5
Parameter X Y V T Z X Y V T Z
Date
2022-07-07 12:00:00 0.417022 0.720324 0.000114 0.302333 0.692775 0.146756 0.092339 0.186260 0.345561 0.030064
2022-07-07 13:00:00 0.396767 0.538817 0.419195 0.685220 0.447748 0.204452 0.878117 0.027388 0.670468 0.812891
2022-07-07 14:00:00 0.417305 0.558690 0.140387 0.198101 0.486278 0.800745 0.968262 0.313424 0.692323 1.578722
2022-07-07 15:00:00 0.876389 0.894607 0.085044 0.039055 1.568379 0.169830 0.878143 0.098347 0.421108 0.799977
2022-07-07 16:00:00 0.957890 0.533165 0.691877 0.315516 1.201818 0.686501 0.834626 0.018288 0.750144 1.167884

One option is with pd.xs:
out = df.xs('X',axis=1,level=1).pow(2).add(df.xs('Y',axis=1,level=1).pow(2))
out.columns = [out.columns, np.repeat(['Z'], 2)]
pd.concat([df, out], axis = 1).sort_index(axis=1)
Node 3 5
Parameter T V X Y Z T V X Y Z
Date
2022-07-07 12:00:00 0.302333 0.000114 0.417022 0.720324 0.692775 0.345561 0.186260 0.146756 0.092339 0.030064
2022-07-07 13:00:00 0.685220 0.419195 0.396767 0.538817 0.447748 0.670468 0.027388 0.204452 0.878117 0.812891
2022-07-07 14:00:00 0.198101 0.140387 0.417305 0.558690 0.486278 0.692323 0.313424 0.800745 0.968262 1.578722
2022-07-07 15:00:00 0.039055 0.085044 0.876389 0.894607 1.568379 0.421108 0.098347 0.169830 0.878143 0.799977
2022-07-07 16:00:00 0.315516 0.691877 0.957890 0.533165 1.201818 0.750144 0.018288 0.686501 0.834626 1.167884
Another option, is to select all the columns, run pow across all the columns in one go, before grouping and concatenating:
out = (df
.loc(axis=1)[:, ['X','Y']]
.pow(2)
.groupby(level='Node', axis=1)
.agg(np.add.reduce,axis=1))
out.columns = [out.columns, np.repeat(['Z'], 2)]
pd.concat([df, out], axis = 1).sort_index(axis=1)

Extract a time and space variable from a moving ship from the ERA5 reanalysis

I want to extract the measured wind from a station inside a moving ship, which I have the latitude, longitude and time values and the wind value for each time step in space. I can extract a fixed point in space for all time steps but I would like to extract for example the wind at time step x to a date longitude and latitude as the ship moves. How can I do this from the code below?
data = xr.open_dataset('C:/Users/William Jacondino/Desktop/Dados/ERA5\\ERA5_2017.nc', decode_times=False)
dir_out = 'C:/Users/William Jacondino/Desktop/MovingShip'
if not os.path.exists(dir_out):
os.makedirs(dir_out)
print("\nReading the observation station names:\n")
stations = pd.read_csv(r"C:/Users/William Jacondino/Desktop/MovingShip/Date-TIME.csv",index_col=0, sep='\;')
print(stations)
Reading the observation station names:
Latitude Longitude
Date-Time
16/11/2017 00:00 0.219547 -38.247914
16/11/2017 06:00 0.861717 -38.188858
16/11/2017 12:00 1.529534 -38.131039
16/11/2017 18:00 2.243760 -38.067467
17/11/2017 00:00 2.961202 -38.009050
... ... ...
10/12/2017 00:00 -5.775127 -35.206581
10/12/2017 06:00 -5.775120 -35.206598
10/12/2017 12:00 -5.775119 -35.206583
10/12/2017 18:00 -5.775122 -35.206584
11/12/2017 00:00 -5.775115 -35.206590
# variável tempo e unidade
times = data.variables['time'][:]
unit = data.time.units
# variáveis latitude (lat) e longitude (lon)
lon = data.variables['longitude'][:]
lat = data.variables['latitude'][:]
# variável temperatura em 2 metros em celsius
temp = data.variables['t2m'][:]-275.15
# variável temperatura do ponto de orvalho em 2 metros em celsius
tempdw = data.variables['d2m'][:]-275.15
# variável sea surface temperature (sst) em celsius
sst = data.variables['sst'][:]-275.15
# variável Surface sensible heat flux sshf
sshf = data.variables['sshf'][:]
unitsshf = data.sshf.units
# variável Surface latent heat flux
slhf = data.variables['slhf'][:]
unitslhf = data.slhf.units
# variável Mean sea level pressure
msl = data.variables['msl'][:]/100
unitmsl = data.msl.units
# variável Total precipitation em mm/h
tp = data.variables['tp'][:]*1000
# componente zonal do vento em 100 metros
uten100 = data.variables['u100'][:]
unitu100 = data.u100.units
# componente meridional do vento em 100 metros
vten100 = data.variables['v100'][:]
unitv100 = data.v100.units
# componente zonal do vento em 10 metros
uten = data.variables['u10'][:]
unitu = data.u10.units
# componente meridional do vento em 10 metros
vten = data.variables['v10'][:]
unitv = data.v10.units
# calculando a velocidade do vento em 10 metros
ws = (uten**2 + vten**2)**(0.5)
# calculando a velocidade do vento em 100 metros
ws100 = (uten100**2 + vten100**2)**(0.5)
# calculando os ângulos de U e V para obter a direção do vento em 10 metros
wdir = (180 + (np.degrees(np.arctan2(uten, vten)))) % 360
# calculando os ângulos de U e V para obter a direção do vento em 100 metros
wdir100 = (180 + (np.degrees(np.arctan2(uten100, vten100)))) % 360
for key, value in stations.iterrows():
#print(key,value[0], value[1], value[2])
station = value[0]
file_name = "{}{}".format(station+'_1991',".csv")
#print(file_name)
lon_point = value[1]
lat_point = value[2]
########################################
# Encontrando o ponto de Latitude e Longitude mais próximo das estações
# Squared difference of lat and lon
sq_diff_lat = (lat - lat_point)**2
sq_diff_lon = (lon - lon_point)**2
# Identifying the index of the minimum value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
print("Generating time series for station {}".format(station))
ref_date = datetime.datetime(int(unit[12:16]),int(unit[17:19]),int(unit[20:22]))
date_range = list()
temp_data = list()
tempdw_data = list()
sst_data = list()
sshf_data = list()
slhf_data = list()
msl_data = list()
tp_data = list()
uten100_data = list()
vten100_data = list()
uten_data = list()
vten_data = list()
ws_data = list()
ws100_data = list()
wdir_data = list()
wdir100_data = list()
for index, time in enumerate(times):
date_time = ref_date+datetime.timedelta(hours=int(time))
date_range.append(date_time)
temp_data.append(temp[index, min_index_lat, min_index_lon].values)
tempdw_data.append(tempdw[index, min_index_lat, min_index_lon].values)
sst_data.append(sst[index, min_index_lat, min_index_lon].values)
sshf_data.append(sshf[index, min_index_lat, min_index_lon].values)
slhf_data.append(slhf[index, min_index_lat, min_index_lon].values)
msl_data.append(msl[index, min_index_lat, min_index_lon].values)
tp_data.append(tp[index, min_index_lat, min_index_lon].values)
uten100_data.append(uten100[index, min_index_lat, min_index_lon].values)
vten100_data.append(vten100[index, min_index_lat, min_index_lon].values)
uten_data.append(uten[index, min_index_lat, min_index_lon].values)
vten_data.append(vten[index, min_index_lat, min_index_lon].values)
ws_data.append(ws[index,min_index_lat,min_index_lon].values)
ws100_data.append(ws100[index,min_index_lat,min_index_lon].values)
wdir_data.append(wdir[index,min_index_lat,min_index_lon].values)
wdir100_data.append(wdir100[index,min_index_lat,min_index_lon].values)
################################################################################
#print(date_range)
df = pd.DataFrame(date_range, columns = ["Date-Time"])
df["Date-Time"] = date_range
df = df.set_index(["Date-Time"])
df["WS10 ({})".format(unitu)] = ws_data
df["WDIR10 ({})".format(units.deg)] = wdir_data
df["WS100 ({})".format(unitu)] = ws100_data
df["WDIR100 ({})".format(units.deg)] = wdir100_data
df["Chuva({})".format(units.mm)] = tp_data
df["MSLP ({})".format(units.hPa)] = msl_data
df["T2M ({})".format(units.degC)] = temp_data
df["Td2M ({})".format(units.degC)] = tempdw_data
df["Surface Sensible Heat Flux ({})".format(unitsshf)] = sshf_data
df["Surface latent heat flux ({})".format(unitslhf)] = slhf_data
df["U10 ({})".format(unitu)] = uten_data
df["V10 ({})".format(unitv)] = vten_data
df["U100 ({})".format(unitu100)] = uten100_data
df["V100 ({})".format(unitv100)] = vten100_data
df["TSM ({})".format(units.degC)] = sst_data
print("The following time series is being saved as .csv files")
df.to_csv(os.path.join(dir_out,file_name), sep=';',encoding="utf-8", index=True)
print("\n! !Successfuly saved all the Time Series the output Directory!!\n{}".format(dir_out))
My code to extract a fixed variable at a given point in space is like this, but I would like to extract during the ship's movement, for example at time 11/12/2017 00:00, latitude -5.775115 and longitude -35.206590 I have a value of the wind, and in the next time step for another latitude x longitude I have another value. How can I adapt my code for this?

This is another perfect use case for xarray's advanced indexing! I feel like this part of the user guide needs a cape and a theme song :)
I'll use a made up dataset and set of stations which are similar (I think) to yours. First step is to reset your Date-Time index, so you can use it in pulling the nearest time value from the xarray.Dataset, since you want a common index for the time, lat, and lon:
In [14]: stations = stations.reset_index(drop=False)
...: stations
Out[14]:
Date-Time Latitude Longitude
0 2017-11-16 00:00:00 0.219547 -38.247914
1 2017-11-16 06:00:00 0.861717 -38.188858
2 2017-11-16 12:00:00 1.529534 -38.131039
3 2017-11-16 18:00:00 2.243760 -38.067467
4 2017-11-17 00:00:00 2.961202 -38.009050
5 2017-12-10 00:00:00 -5.775127 -35.206581
6 2017-12-10 06:00:00 -5.775120 -35.206598
7 2017-12-10 12:00:00 -5.775119 -35.206583
8 2017-12-10 18:00:00 -5.775122 -35.206584
9 2017-12-11 00:00:00 -5.775115 -35.206590
In [15]: ds
Out[15]:
<xarray.Dataset>
Dimensions: (lat: 40, lon: 40, time: 241)
Coordinates:
* lat (lat) float64 -9.75 -9.25 -8.75 -8.25 -7.75 ... 8.25 8.75 9.25 9.75
* lon (lon) float64 -44.75 -44.25 -43.75 -43.25 ... -26.25 -25.75 -25.25
* time (time) datetime64[ns] 2017-11-01 2017-11-01T06:00:00 ... 2017-12-31
Data variables:
temp (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
tempdw (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
sst (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
ws (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
ws100 (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
wdir (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
wdir100 (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
Using the advanced indexing rules, if we select from the dataset using DataArrays as indexers, the result will be reshaped to match the indexer. What this means is that we can take your stations dataframe, which has the values time, lat, and lon, and pull the nearest indices from the xarray dataset:
In [16]: ds_over_observations = ds.sel(
...: time=stations["Date-Time"].to_xarray(),
...: lat=stations["Latitude"].to_xarray(),
...: lon=stations["Longitude"].to_xarray(),
...: method="nearest",
...: )
Now, our data has the same index as your dataframe!
In [17]: ds_over_observations
Out[17]:
<xarray.Dataset>
Dimensions: (index: 10)
Coordinates:
lat (index) float64 0.25 0.75 1.75 2.25 ... -5.75 -5.75 -5.75 -5.75
lon (index) float64 -38.25 -38.25 -38.25 ... -35.25 -35.25 -35.25
time (index) datetime64[ns] 2017-11-16 ... 2017-12-11
* index (index) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
temp (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
tempdw (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
sst (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
ws (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
ws100 (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
wdir (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
wdir100 (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
You can dump this into pandas with .to_dataframe:
In [18]: df = ds_over_observations.to_dataframe()
In [19]: df
Out[19]:
lat lon time temp tempdw sst ws ws100 wdir wdir100
index
0 0.25 -38.25 2017-11-16 00:00:00 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724
1 0.75 -38.25 2017-11-16 06:00:00 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025
2 1.75 -38.25 2017-11-16 12:00:00 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417
3 2.25 -38.25 2017-11-16 18:00:00 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019
4 2.75 -38.25 2017-11-17 00:00:00 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266
5 -5.75 -35.25 2017-12-10 00:00:00 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490
6 -5.75 -35.25 2017-12-10 06:00:00 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541
7 -5.75 -35.25 2017-12-10 12:00:00 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352
8 -5.75 -35.25 2017-12-10 18:00:00 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058
9 -5.75 -35.25 2017-12-11 00:00:00 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493
The index in the result is the same one as the stations data. If you'd like, you could merge in the original values using pd.concat([stations, df], axis=1).set_index("Date-Time") to get your original index back, alongside all the weather data:
In [20]: pd.concat([stations, df], axis=1).set_index("Date-Time")
Out[20]:
Latitude Longitude lat lon time temp tempdw sst ws ws100 wdir wdir100
Date-Time
2017-11-16 00:00:00 0.219547 -38.247914 0.25 -38.25 2017-11-16 00:00:00 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724
2017-11-16 06:00:00 0.861717 -38.188858 0.75 -38.25 2017-11-16 06:00:00 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025
2017-11-16 12:00:00 1.529534 -38.131039 1.75 -38.25 2017-11-16 12:00:00 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417
2017-11-16 18:00:00 2.243760 -38.067467 2.25 -38.25 2017-11-16 18:00:00 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019
2017-11-17 00:00:00 2.961202 -38.009050 2.75 -38.25 2017-11-17 00:00:00 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266
2017-12-10 00:00:00 -5.775127 -35.206581 -5.75 -35.25 2017-12-10 00:00:00 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490
2017-12-10 06:00:00 -5.775120 -35.206598 -5.75 -35.25 2017-12-10 06:00:00 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541
2017-12-10 12:00:00 -5.775119 -35.206583 -5.75 -35.25 2017-12-10 12:00:00 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352
2017-12-10 18:00:00 -5.775122 -35.206584 -5.75 -35.25 2017-12-10 18:00:00 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058
2017-12-11 00:00:00 -5.775115 -35.206590 -5.75 -35.25 2017-12-11 00:00:00 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493

How do I get a time delta that is closest to 0 days?

I have the following dataframe:
gp_columns = {
'name': ['companyA', 'companyB'],
'firm_ID' : [1, 2],
'timestamp_one' : ['2016-04-01', '2017-09-01']
}
fund_columns = {
'firm_ID': [1, 1, 2, 2, 2],
'department_ID' : [10, 11, 20, 21, 22],
'timestamp_mult' : ['2015-01-01', '2016-03-01', '2016-10-01', '2017-02-01', '2018-11-01'],
'number' : [400, 500, 1000, 3000, 4000]
}
gp_df = pd.DataFrame(gp_columns)
fund_df = pd.DataFrame(fund_columns)
gp_df['timestamp_one'] = pd.to_datetime(gp_df['timestamp_one'])
fund_df['timestamp_mult'] = pd.to_datetime(fund_df['timestamp_mult'])
merged_df = gp_df.merge(fund_df)
merged_df
merged_df_v1 = merged_df.copy()
merged_df_v1['incidence_num'] = merged_df.groupby('firm_ID')['department_ID']\
.transform('cumcount')
merged_df_v1['incidence_num'] = merged_df_v1['incidence_num'] + 1
merged_df_v1['time_delta'] = merged_df_v1['timestamp_mult'] - merged_df_v1['timestamp_one']
merged_wide = pd.pivot(merged_df_v1, index = ['name','firm_ID', 'timestamp_one'], \
columns = 'incidence_num', \
values = ['department_ID', 'time_delta', 'timestamp_mult', 'number'])
merged_wide.reset_index()
that looks as follows:
My question is how i get a column that calculates the minimum time delta (so closest to 0). Note that the time delta can be negative or positive, so .abs() does not work for me here.
I want a dataframe with this particular output:

You can stack (which removes NaTs) and groupby.first after sorting the rows by absolute value (with the key parameter of sort_values):
df = merged_wide.reset_index()
df['time_delta_min'] = (df['time_delta'].stack()
.sort_values(key=abs)
.groupby(level=0).first()
)
output:
name firm_ID timestamp_one department_ID \
incidence_num 1 2 3
0 companyA 1 2016-04-01 10 11 NaN
1 companyB 2 2017-09-01 20 21 22
time_delta timestamp_mult \
incidence_num 1 2 3 1 2
0 -456 days -31 days NaT 2015-01-01 2016-03-01
1 -335 days -212 days 426 days 2016-10-01 2017-02-01
number time_delta_min
incidence_num 3 1 2 3
0 NaT 400 500 NaN -31 days
1 2018-11-01 1000 3000 4000 -212 days

Use lookup with indices of absolute values by DataFrame.idxmin:
idx, cols = pd.factorize(df['time_delta'].abs().idxmin(axis=1))
df['time_delta_min'] = (df['time_delta'].reindex(cols, axis=1).to_numpy()
[np.arange(len(df)), idx])
print (df)

Pandas time re-sampling categorical data from a column with calculations from another numerical column

I have a data-frame with a categorical column and a numerical , the index set to time data
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
the above code gives:
Kind Values
date
2013-03-01 A 1.0
2013-03-02 B 1.5
2013-03-01 A 2.0
2013-03-02 B 3.0
2013-03-01 B 5.0
2013-03-02 A 3.0
My aim is to achieve the below data-frame:
A_count B_count A_Val max B_Val max
date
2013-03-01 2 1 2 5
2013-03-02 0 3 0 3
Which also has the time as index . Here, I note that If we use
data = pd.DataFrame(data.resample('D')['Pack'].value_counts())
we get :
Kind
date Kind
2013-03-01 A 2
B 1
2013-03-02 B 3

Use DataFrame.pivot_table with flattening MultiIndex in columns in list comprehension:
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
#is possible omit
#df = df.set_index('date')
df = df.pivot_table(index='date', columns='Kind', values='Values', aggfunc=['count','max'])
df.columns = [f'{b}_{a}' for a, b in df.columns]
print (df)
A_count B_count A_max B_max
date
2013-03-01 2.0 1.0 2.0 5.0
2013-03-02 NaN 3.0 NaN 3.0
Another solution with Grouper for resample by days:
df = df.set_index('date')
df = df.groupby([pd.Grouper(freq='d'), 'Kind'])['Values'].agg(['count','max']).unstack()
df.columns = [f'{b}_{a}' for a, b in df.columns]

How to use if conditions in Pandas?

I am working on pandas and I have four column
Name Sensex_index Start_Date End_Date
AAA 0.5 20/08/2016 25/09/2016
AAA 0.8 26/08/2016 29/08/2016
AAA 0.4 30/08/2016 31/08/2016
AAA 0.9 01/09/2016 05/09/2016
AAA 0.5 12/09/2016 22/09/2016
AAA 0.3 24/09/2016 29/09/2016
ABC 0.9 01/01/2017 15/01/2017
ABC 0.5 23/01/2017 30/01/2017
ABC 0.7 02/02/2017 15/03/2017
If the sensex index of same name increases from lower index and moves to higher index, then the Termination date is the previous value, for example, I am looking for the following output,
Name Sensex_index Actual_Start Termination_Date
AAA 0.5 20/08/2016 31/08/2016
AAA 0.8 20/08/2016 31/08/2016
AAA 0.4 20/08/2016 31/08/2016 [high to low; low to high,terminate]
AAA 0.9 01/09/2016 29/09/2016
AAA 0.5 01/09/2016 29/09/2016
AAA 0.3 01/09/2016 29/09/2016 [end of AAA]
ABC 0.9 01/01/2017 30/01/2017
ABC 0.5 01/01/2017 30/01/2017 [high to low; low to high,terminate]
ABC 0.7 02/02/2017 15/03/2017 [end of ABC]

#Setup
df = pd.DataFrame(data = [['AAA', 0.5, '20/08/2016', '25/09/2016'],
['AAA', 0.8, '26/08/2016', '29/08/2016'],
['AAA', 0.4, '30/08/2016', '31/08/2016'],
['AAA', 0.9, '01/09/2016', '05/09/2016'],
['AAA', 0.5, '12/09/2016', '22/09/2016'],
['AAA', 0.3, '24/09/2016', '29/09/2016'],
['ABC', 0.9, '01/01/2017', '15/01/2017'],
['ABC', 0.5, '23/01/2017', '30/01/2017'],
['ABC', 0.7, '02/02/2017', '15/03/2017']], columns = ['Name', 'Sensex_index', 'Start_Date', 'End_Date'])
#Find the rows where price change from high to low and then to high
df['change'] = df.groupby('Name')['Sensex_index'].apply(lambda x: x.rolling(3,center=True).apply(lambda y: True if (y[1]<y[0] and y[1]<y[2]) else False))
#Find the last row for each name
df.iloc[df.groupby('Name')['change'].tail(1).index, -1] = 1.0
#Set End_Date as Termination_Date for those changing points
df['Termination_Date'] = df.apply(lambda x: x.End_Date if x.change>0 else np.nan, axis=1)
#Set Actual_Start
df['Actual_Start'] = df.apply(lambda x: x.Start_Date if (x.name==0
or x.Name!= df.iloc[x.name-1]['Name']
or df.iloc[x.name-1]['change']>0)
else np.nan, axis=1)
#back fill the Termination_Date for other rows.
df.Termination_Date.fillna(method='bfill', inplace=True)
#forward fill the Actual_Start for other rows.
df.Actual_Start.fillna(method='ffill', inplace=True)
print(df)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to filter a dataframe given a specific daily hour? - pandas

Related

Pandas - Add a new calculated column to a MultiIndex column dataframe

Extract a time and space variable from a moving ship from the ERA5 reanalysis

How do I get a time delta that is closest to 0 days?

Pandas time re-sampling categorical data from a column with calculations from another numerical column

How to use if conditions in Pandas?

Categories

Resources