Pandas X_axis hourly [duplicate] - pandas

This question already has answers here:
Pandas timeseries plot setting x-axis major and minor ticks and labels
(2 answers)
Closed 5 years ago.
I have this little Pandascode:
graph = auswahl[['Volumenstrom_Außen', 'Vpunkt_Gesamt','Zuluft_Druck_10','Abluft_Druck_10']]
a = graph.plot(figsize=[50,10])
a.set(ylabel="m³/h", xlabel="Zeit", title="Volumenströme")#,ylim=[0,100])
a.legend(loc="upper left")
plt.show()
How can I set the X-Axis showing every Hour?
the dataframe looks like this:
Volumenstrom_Außen Vpunkt_Gesamt Zuluft_Druck Abluft_Druck
Zeit
2018-02-15 16:49:00 1021.708443 752.699 49.328 46.811
2018-02-15 16:49:15 1021.708443 752.699 49.328 46.811
2018-02-15 16:49:30 1021.708443 752.699 49.328 46.811
2018-02-15 16:49:45 1021.708443 752.699 49.328 46.811
2018-02-15 16:50:00 1021.708443 752.699 49.328 46.811
2018-02-15 16:50:15 1021.708443 752.699 49.328 46.811
2018-02-15 16:50:30 1021.708443 752.699 49.328 46.811
2018-02-15 16:50:45 1021.708443 752.699 49.328 46.811
2018-02-15 16:51:00 1092.171094 752.699 49.328 46.811
2018-02-15 16:51:15 1092.171094 752.699 49.328 46.811

Let's take this example dataframe, whose index is at minute granularity
import pandas as pd
import random
ts_index = pd.date_range('1/1/2000', periods=1000, freq='T')
v1 = [random.random() for i in range(1000)]
v2 = [random.random() for i in range(1000)]
v3 = [random.random() for i in range(1000)]
ts_df = pd.DataFrame({'v1':v1,'v2':v2,'v3':v3},index=ts_index)
ts_df.head()
v1 v2 v3
2000-01-01 00:00:00 0.593039 0.017351 0.742111
2000-01-01 00:01:00 0.563233 0.837362 0.869767
2000-01-01 00:02:00 0.453925 0.962600 0.690868
2000-01-01 00:03:00 0.757895 0.123610 0.622777
2000-01-01 00:04:00 0.759841 0.906674 0.263902
We could use pandas.DataFrame.resample to downsample this data to hourly granularity, like shown below
hourly_mean_df = ts_df.resample('H').mean() # you can use .sum() also
hourly_mean_df.head()
v1 v2 v3
2000-01-01 00:00:00 0.516001 0.461119 0.467895
2000-01-01 01:00:00 0.530603 0.458208 0.550892
2000-01-01 02:00:00 0.472090 0.522278 0.508345
2000-01-01 03:00:00 0.515713 0.486906 0.541538
2000-01-01 04:00:00 0.514543 0.478097 0.489217
Now you can plot this hourly summary
hourly_mean_df.plot()

Related

Pandas - Add a new calculated column to a MultiIndex column dataframe

I have a Dataframe with the following structure:
np.random.seed(1)
mi = pd.MultiIndex.from_product([[3, 5], ["X","Y","V","T"]], names=["Node", "Parameter"])
df = pd.DataFrame(index=pd.DatetimeIndex(['2022-07-07 12:00:00', '2022-07-07 13:00:00',
'2022-07-07 14:00:00', '2022-07-07 15:00:00',
'2022-07-07 16:00:00'],
dtype='datetime64[ns]', name='Date', freq=None), columns=mi, data=np.random.rand(5,8))
print(df)
Node 3 5
Parameter X Y V T X Y V T
Date
2022-07-07 12:00:00 0.417022 0.720324 0.000114 0.302333 0.146756 0.092339 0.186260 0.345561
2022-07-07 13:00:00 0.396767 0.538817 0.419195 0.685220 0.204452 0.878117 0.027388 0.670468
2022-07-07 14:00:00 0.417305 0.558690 0.140387 0.198101 0.800745 0.968262 0.313424 0.692323
2022-07-07 15:00:00 0.876389 0.894607 0.085044 0.039055 0.169830 0.878143 0.098347 0.421108
2022-07-07 16:00:00 0.957890 0.533165 0.691877 0.315516 0.686501 0.834626 0.018288 0.750144
I would like to add a new calculated column "Z" for each Node, based on the value "X" ** 2 + "Y" ** 2.
The following achieves the desired result:
x = df.loc[:,(slice(None),"X")]
y = df.loc[:,(slice(None),"Y")]
z = (x**2).rename(columns={"X":"Z"}) + (y ** 2).rename(columns={"Y":"Z"})
result = df.join(z).sort_index(axis=1)
Is there a more straightforward way to achieve this?
For example, using df.xs to select the desired column data e.g. df.xs("X", axis=1, level=1) **2 + df.xs("X", axis=1, level=1) ** 2, how can I then assign the result to the original dataframe?
You can use groupby.apply:
(df.groupby(level='Node', axis=1)
.apply(lambda g: g.droplevel('Node', axis=1).eval('Z = X**2 + Y**2'))
)
Or, with xs and drop_level=False on one of the values:
(df.join((df.xs('X', axis=1, level=1, drop_level=False)**2
+df.xs('Y', axis=1, level=1)**2
).rename(columns={'X': 'Z'}, level=1)
)
.sort_index(axis=1, level=0, sort_remaining=False)
)
Output:
Node 3 5
Parameter X Y V T Z X Y V T Z
Date
2022-07-07 12:00:00 0.417022 0.720324 0.000114 0.302333 0.692775 0.146756 0.092339 0.186260 0.345561 0.030064
2022-07-07 13:00:00 0.396767 0.538817 0.419195 0.685220 0.447748 0.204452 0.878117 0.027388 0.670468 0.812891
2022-07-07 14:00:00 0.417305 0.558690 0.140387 0.198101 0.486278 0.800745 0.968262 0.313424 0.692323 1.578722
2022-07-07 15:00:00 0.876389 0.894607 0.085044 0.039055 1.568379 0.169830 0.878143 0.098347 0.421108 0.799977
2022-07-07 16:00:00 0.957890 0.533165 0.691877 0.315516 1.201818 0.686501 0.834626 0.018288 0.750144 1.167884
One option is with pd.xs:
out = df.xs('X',axis=1,level=1).pow(2).add(df.xs('Y',axis=1,level=1).pow(2))
out.columns = [out.columns, np.repeat(['Z'], 2)]
pd.concat([df, out], axis = 1).sort_index(axis=1)
Node 3 5
Parameter T V X Y Z T V X Y Z
Date
2022-07-07 12:00:00 0.302333 0.000114 0.417022 0.720324 0.692775 0.345561 0.186260 0.146756 0.092339 0.030064
2022-07-07 13:00:00 0.685220 0.419195 0.396767 0.538817 0.447748 0.670468 0.027388 0.204452 0.878117 0.812891
2022-07-07 14:00:00 0.198101 0.140387 0.417305 0.558690 0.486278 0.692323 0.313424 0.800745 0.968262 1.578722
2022-07-07 15:00:00 0.039055 0.085044 0.876389 0.894607 1.568379 0.421108 0.098347 0.169830 0.878143 0.799977
2022-07-07 16:00:00 0.315516 0.691877 0.957890 0.533165 1.201818 0.750144 0.018288 0.686501 0.834626 1.167884
Another option, is to select all the columns, run pow across all the columns in one go, before grouping and concatenating:
out = (df
.loc(axis=1)[:, ['X','Y']]
.pow(2)
.groupby(level='Node', axis=1)
.agg(np.add.reduce,axis=1))
out.columns = [out.columns, np.repeat(['Z'], 2)]
pd.concat([df, out], axis = 1).sort_index(axis=1)

Extract a time and space variable from a moving ship from the ERA5 reanalysis

I want to extract the measured wind from a station inside a moving ship, which I have the latitude, longitude and time values and the wind value for each time step in space. I can extract a fixed point in space for all time steps but I would like to extract for example the wind at time step x to a date longitude and latitude as the ship moves. How can I do this from the code below?
data = xr.open_dataset('C:/Users/William Jacondino/Desktop/Dados/ERA5\\ERA5_2017.nc', decode_times=False)
dir_out = 'C:/Users/William Jacondino/Desktop/MovingShip'
if not os.path.exists(dir_out):
os.makedirs(dir_out)
print("\nReading the observation station names:\n")
stations = pd.read_csv(r"C:/Users/William Jacondino/Desktop/MovingShip/Date-TIME.csv",index_col=0, sep='\;')
print(stations)
Reading the observation station names:
Latitude Longitude
Date-Time
16/11/2017 00:00 0.219547 -38.247914
16/11/2017 06:00 0.861717 -38.188858
16/11/2017 12:00 1.529534 -38.131039
16/11/2017 18:00 2.243760 -38.067467
17/11/2017 00:00 2.961202 -38.009050
... ... ...
10/12/2017 00:00 -5.775127 -35.206581
10/12/2017 06:00 -5.775120 -35.206598
10/12/2017 12:00 -5.775119 -35.206583
10/12/2017 18:00 -5.775122 -35.206584
11/12/2017 00:00 -5.775115 -35.206590
# variável tempo e unidade
times = data.variables['time'][:]
unit = data.time.units
# variáveis latitude (lat) e longitude (lon)
lon = data.variables['longitude'][:]
lat = data.variables['latitude'][:]
# variável temperatura em 2 metros em celsius
temp = data.variables['t2m'][:]-275.15
# variável temperatura do ponto de orvalho em 2 metros em celsius
tempdw = data.variables['d2m'][:]-275.15
# variável sea surface temperature (sst) em celsius
sst = data.variables['sst'][:]-275.15
# variável Surface sensible heat flux sshf
sshf = data.variables['sshf'][:]
unitsshf = data.sshf.units
# variável Surface latent heat flux
slhf = data.variables['slhf'][:]
unitslhf = data.slhf.units
# variável Mean sea level pressure
msl = data.variables['msl'][:]/100
unitmsl = data.msl.units
# variável Total precipitation em mm/h
tp = data.variables['tp'][:]*1000
# componente zonal do vento em 100 metros
uten100 = data.variables['u100'][:]
unitu100 = data.u100.units
# componente meridional do vento em 100 metros
vten100 = data.variables['v100'][:]
unitv100 = data.v100.units
# componente zonal do vento em 10 metros
uten = data.variables['u10'][:]
unitu = data.u10.units
# componente meridional do vento em 10 metros
vten = data.variables['v10'][:]
unitv = data.v10.units
# calculando a velocidade do vento em 10 metros
ws = (uten**2 + vten**2)**(0.5)
# calculando a velocidade do vento em 100 metros
ws100 = (uten100**2 + vten100**2)**(0.5)
# calculando os ângulos de U e V para obter a direção do vento em 10 metros
wdir = (180 + (np.degrees(np.arctan2(uten, vten)))) % 360
# calculando os ângulos de U e V para obter a direção do vento em 100 metros
wdir100 = (180 + (np.degrees(np.arctan2(uten100, vten100)))) % 360
for key, value in stations.iterrows():
#print(key,value[0], value[1], value[2])
station = value[0]
file_name = "{}{}".format(station+'_1991',".csv")
#print(file_name)
lon_point = value[1]
lat_point = value[2]
########################################
# Encontrando o ponto de Latitude e Longitude mais próximo das estações
# Squared difference of lat and lon
sq_diff_lat = (lat - lat_point)**2
sq_diff_lon = (lon - lon_point)**2
# Identifying the index of the minimum value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
print("Generating time series for station {}".format(station))
ref_date = datetime.datetime(int(unit[12:16]),int(unit[17:19]),int(unit[20:22]))
date_range = list()
temp_data = list()
tempdw_data = list()
sst_data = list()
sshf_data = list()
slhf_data = list()
msl_data = list()
tp_data = list()
uten100_data = list()
vten100_data = list()
uten_data = list()
vten_data = list()
ws_data = list()
ws100_data = list()
wdir_data = list()
wdir100_data = list()
for index, time in enumerate(times):
date_time = ref_date+datetime.timedelta(hours=int(time))
date_range.append(date_time)
temp_data.append(temp[index, min_index_lat, min_index_lon].values)
tempdw_data.append(tempdw[index, min_index_lat, min_index_lon].values)
sst_data.append(sst[index, min_index_lat, min_index_lon].values)
sshf_data.append(sshf[index, min_index_lat, min_index_lon].values)
slhf_data.append(slhf[index, min_index_lat, min_index_lon].values)
msl_data.append(msl[index, min_index_lat, min_index_lon].values)
tp_data.append(tp[index, min_index_lat, min_index_lon].values)
uten100_data.append(uten100[index, min_index_lat, min_index_lon].values)
vten100_data.append(vten100[index, min_index_lat, min_index_lon].values)
uten_data.append(uten[index, min_index_lat, min_index_lon].values)
vten_data.append(vten[index, min_index_lat, min_index_lon].values)
ws_data.append(ws[index,min_index_lat,min_index_lon].values)
ws100_data.append(ws100[index,min_index_lat,min_index_lon].values)
wdir_data.append(wdir[index,min_index_lat,min_index_lon].values)
wdir100_data.append(wdir100[index,min_index_lat,min_index_lon].values)
################################################################################
#print(date_range)
df = pd.DataFrame(date_range, columns = ["Date-Time"])
df["Date-Time"] = date_range
df = df.set_index(["Date-Time"])
df["WS10 ({})".format(unitu)] = ws_data
df["WDIR10 ({})".format(units.deg)] = wdir_data
df["WS100 ({})".format(unitu)] = ws100_data
df["WDIR100 ({})".format(units.deg)] = wdir100_data
df["Chuva({})".format(units.mm)] = tp_data
df["MSLP ({})".format(units.hPa)] = msl_data
df["T2M ({})".format(units.degC)] = temp_data
df["Td2M ({})".format(units.degC)] = tempdw_data
df["Surface Sensible Heat Flux ({})".format(unitsshf)] = sshf_data
df["Surface latent heat flux ({})".format(unitslhf)] = slhf_data
df["U10 ({})".format(unitu)] = uten_data
df["V10 ({})".format(unitv)] = vten_data
df["U100 ({})".format(unitu100)] = uten100_data
df["V100 ({})".format(unitv100)] = vten100_data
df["TSM ({})".format(units.degC)] = sst_data
print("The following time series is being saved as .csv files")
df.to_csv(os.path.join(dir_out,file_name), sep=';',encoding="utf-8", index=True)
print("\n! !Successfuly saved all the Time Series the output Directory!!\n{}".format(dir_out))
My code to extract a fixed variable at a given point in space is like this, but I would like to extract during the ship's movement, for example at time 11/12/2017 00:00, latitude -5.775115 and longitude -35.206590 I have a value of the wind, and in the next time step for another latitude x longitude I have another value. How can I adapt my code for this?
This is another perfect use case for xarray's advanced indexing! I feel like this part of the user guide needs a cape and a theme song :)
I'll use a made up dataset and set of stations which are similar (I think) to yours. First step is to reset your Date-Time index, so you can use it in pulling the nearest time value from the xarray.Dataset, since you want a common index for the time, lat, and lon:
In [14]: stations = stations.reset_index(drop=False)
...: stations
Out[14]:
Date-Time Latitude Longitude
0 2017-11-16 00:00:00 0.219547 -38.247914
1 2017-11-16 06:00:00 0.861717 -38.188858
2 2017-11-16 12:00:00 1.529534 -38.131039
3 2017-11-16 18:00:00 2.243760 -38.067467
4 2017-11-17 00:00:00 2.961202 -38.009050
5 2017-12-10 00:00:00 -5.775127 -35.206581
6 2017-12-10 06:00:00 -5.775120 -35.206598
7 2017-12-10 12:00:00 -5.775119 -35.206583
8 2017-12-10 18:00:00 -5.775122 -35.206584
9 2017-12-11 00:00:00 -5.775115 -35.206590
In [15]: ds
Out[15]:
<xarray.Dataset>
Dimensions: (lat: 40, lon: 40, time: 241)
Coordinates:
* lat (lat) float64 -9.75 -9.25 -8.75 -8.25 -7.75 ... 8.25 8.75 9.25 9.75
* lon (lon) float64 -44.75 -44.25 -43.75 -43.25 ... -26.25 -25.75 -25.25
* time (time) datetime64[ns] 2017-11-01 2017-11-01T06:00:00 ... 2017-12-31
Data variables:
temp (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
tempdw (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
sst (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
ws (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
ws100 (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
wdir (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
wdir100 (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
Using the advanced indexing rules, if we select from the dataset using DataArrays as indexers, the result will be reshaped to match the indexer. What this means is that we can take your stations dataframe, which has the values time, lat, and lon, and pull the nearest indices from the xarray dataset:
In [16]: ds_over_observations = ds.sel(
...: time=stations["Date-Time"].to_xarray(),
...: lat=stations["Latitude"].to_xarray(),
...: lon=stations["Longitude"].to_xarray(),
...: method="nearest",
...: )
Now, our data has the same index as your dataframe!
In [17]: ds_over_observations
Out[17]:
<xarray.Dataset>
Dimensions: (index: 10)
Coordinates:
lat (index) float64 0.25 0.75 1.75 2.25 ... -5.75 -5.75 -5.75 -5.75
lon (index) float64 -38.25 -38.25 -38.25 ... -35.25 -35.25 -35.25
time (index) datetime64[ns] 2017-11-16 ... 2017-12-11
* index (index) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
temp (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
tempdw (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
sst (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
ws (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
ws100 (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
wdir (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
wdir100 (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
You can dump this into pandas with .to_dataframe:
In [18]: df = ds_over_observations.to_dataframe()
In [19]: df
Out[19]:
lat lon time temp tempdw sst ws ws100 wdir wdir100
index
0 0.25 -38.25 2017-11-16 00:00:00 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724
1 0.75 -38.25 2017-11-16 06:00:00 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025
2 1.75 -38.25 2017-11-16 12:00:00 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417
3 2.25 -38.25 2017-11-16 18:00:00 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019
4 2.75 -38.25 2017-11-17 00:00:00 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266
5 -5.75 -35.25 2017-12-10 00:00:00 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490
6 -5.75 -35.25 2017-12-10 06:00:00 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541
7 -5.75 -35.25 2017-12-10 12:00:00 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352
8 -5.75 -35.25 2017-12-10 18:00:00 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058
9 -5.75 -35.25 2017-12-11 00:00:00 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493
The index in the result is the same one as the stations data. If you'd like, you could merge in the original values using pd.concat([stations, df], axis=1).set_index("Date-Time") to get your original index back, alongside all the weather data:
In [20]: pd.concat([stations, df], axis=1).set_index("Date-Time")
Out[20]:
Latitude Longitude lat lon time temp tempdw sst ws ws100 wdir wdir100
Date-Time
2017-11-16 00:00:00 0.219547 -38.247914 0.25 -38.25 2017-11-16 00:00:00 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724
2017-11-16 06:00:00 0.861717 -38.188858 0.75 -38.25 2017-11-16 06:00:00 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025
2017-11-16 12:00:00 1.529534 -38.131039 1.75 -38.25 2017-11-16 12:00:00 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417
2017-11-16 18:00:00 2.243760 -38.067467 2.25 -38.25 2017-11-16 18:00:00 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019
2017-11-17 00:00:00 2.961202 -38.009050 2.75 -38.25 2017-11-17 00:00:00 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266
2017-12-10 00:00:00 -5.775127 -35.206581 -5.75 -35.25 2017-12-10 00:00:00 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490
2017-12-10 06:00:00 -5.775120 -35.206598 -5.75 -35.25 2017-12-10 06:00:00 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541
2017-12-10 12:00:00 -5.775119 -35.206583 -5.75 -35.25 2017-12-10 12:00:00 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352
2017-12-10 18:00:00 -5.775122 -35.206584 -5.75 -35.25 2017-12-10 18:00:00 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058
2017-12-11 00:00:00 -5.775115 -35.206590 -5.75 -35.25 2017-12-11 00:00:00 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493

Finding ranges from a dataframe

I have a dataframe that looks like below,
Date 3tier1 3tier2
2013-01-01 08:00:00+08:00 20.97946282 20.97946282
2013-01-02 08:00:00+08:00 20.74539378 20.74539378
2013-01-03 08:00:00+08:00 20.51126054 20.51126054
2013-01-04 08:00:00+08:00 20.27707322 20.27707322
2013-01-05 08:00:00+08:00 20.04284112 20.04284112
2013-01-06 08:00:00+08:00 19.80857234 19.80857234
2013-01-07 08:00:00+08:00 19.57427331 19.57427331
2013-01-08 08:00:00+08:00 19.33994822 19.33994822
2013-01-09 08:00:00+08:00 19.10559849 19.10559849
2013-01-10 08:00:00+08:00 18.87122241 18.87122241
2013-01-11 08:00:00+08:00 18.63681507 18.63681507
2013-01-12 08:00:00+08:00 18.40236877 18.40236877
2013-01-13 08:00:00+08:00 18.16787383 18.16787383
2013-01-14 08:00:00+08:00 17.93331972 17.93331972
2013-01-15 08:00:00+08:00 17.69869612 17.69869612
2013-01-16 08:00:00+08:00 17.46399372 17.46399372
2013-01-17 08:00:00+08:00 17.22920466 17.22920466
2013-01-18 08:00:00+08:00 16.9943227 16.9943227
2013-01-19 08:00:00+08:00 17.27850867 16.7593431
2013-01-20 08:00:00+08:00 17.69762778 16.52426248
2013-01-21 08:00:00+08:00 18.12537837 16.28907864
2013-01-22 08:00:00+08:00 18.56180775 16.05379043
2013-01-23 08:00:00+08:00 19.00689471 15.81839767
2013-01-24 08:00:00+08:00 19.46053468 15.58290109
2013-01-25 08:00:00+08:00 19.92252218 15.3473024
2013-01-26 08:00:00+08:00 20.3925305 15.11160423
2013-01-27 08:00:00+08:00 20.87008788 14.87581016
2013-01-28 08:00:00+08:00 21.35454987 14.63992467
2013-01-29 08:00:00+08:00 21.84506726 14.40395298
2013-01-30 08:00:00+08:00 22.34054913 14.16790086
2013-01-31 08:00:00+08:00 22.83962058 13.93177434
2013-02-01 08:00:00+08:00 23.34057473 13.69557937
2013-02-02 08:00:00+08:00 23.84131896 13.45932144
2013-02-03 08:00:00+08:00 24.33931544 13.22300514
2013-02-04 08:00:00+08:00 24.8315166 12.98663374
2013-02-05 08:00:00+08:00 25.31429677 12.7502088
2013-02-06 08:00:00+08:00 25.78338191 12.51372976
2013-02-07 08:00:00+08:00 26.23378052 12.27719367
2013-02-08 08:00:00+08:00 26.65971992 12.04059517
2013-02-09 08:00:00+08:00 27.05459343 11.80392662
2013-02-10 08:00:00+08:00 27.41092527 11.56717871
2013-02-11 08:00:00+08:00 27.72036088 11.3303412
2013-02-12 08:00:00+08:00 27.97369094 11.09340384
2013-02-13 08:00:00+08:00 28.16091685 10.85635718
2013-02-14 08:00:00+08:00 28.27136466 10.61919323
2013-02-15 08:00:00+08:00 28.29385218 10.38190579
2013-02-16 08:00:00+08:00 28.21691143 10.14449064
2013-02-17 08:00:00+08:00 28.02906576 9.906945571
2013-02-18 08:00:00+08:00 27.71915819 9.669270289
2013-02-19 08:00:00+08:00 27.27672516 9.431466436
2013-02-20 08:00:00+08:00 26.69240919 9.193537583
2013-02-21 08:00:00+08:00 25.9584032 8.955489323
2013-02-22 08:00:00+08:00 25.06891975 8.717329426
2013-02-23 08:00:00+08:00 24.02067835 8.479068052
2013-02-24 08:00:00+08:00 22.81340411 8.240718006
2013-02-25 08:00:00+08:00 21.45033241 8.002294987
2013-02-26 08:00:00+08:00 19.93872048 7.763817801
2013-02-27 08:00:00+08:00 18.29038758 7.525308512
2013-02-28 08:00:00+08:00 16.5223583 7.286792516
2013-03-01 08:00:00+08:00 14.65781009 7.048298548
2013-03-02 08:00:00+08:00 12.72782154 6.809858708
2013-03-03 08:00:00+08:00 10.77512952 6.57150857
2013-03-04 08:00:00+08:00 8.862866684 6.333287469
2013-03-05 08:00:00+08:00 7.095368405 6.095239078
2013-03-06 08:00:00+08:00 5.857412338 5.857412338
2013-03-07 08:00:00+08:00 6.062085995 5.619862847
2013-03-08 08:00:00+08:00 7.707047277 5.382654808
2013-03-09 08:00:00+08:00 9.419192265 5.145863673
2013-03-10 08:00:00+08:00 11.12489254 4.909579657
2013-03-11 08:00:00+08:00 12.78439056 4.673912321
2013-03-12 08:00:00+08:00 14.37406958 4.438996486
2013-03-13 08:00:00+08:00 15.87932086 4.204999838
2013-03-14 08:00:00+08:00 17.29126015 3.97213278
2013-03-15 08:00:00+08:00 18.60496304 3.740661371
2013-03-16 08:00:00+08:00 19.81836754 3.510924673
2013-03-17 08:00:00+08:00 20.9315104 3.283358444
2013-03-18 08:00:00+08:00 21.94595693 3.058528064
2013-03-19 08:00:00+08:00 22.86436015 2.837174881
2013-03-20 08:00:00+08:00 23.69011593 2.620282024
2013-03-21 08:00:00+08:00 24.42709384 2.409168144
2013-03-22 08:00:00+08:00 25.07942941 2.205620134
2013-03-23 08:00:00+08:00 25.65136634 2.012076744
2013-03-24 08:00:00+08:00 26.14713926 1.831868652
2013-03-25 08:00:00+08:00 26.57088882 1.669492776
2013-03-26 08:00:00+08:00 26.92660259 1.53082259
2013-03-27 08:00:00+08:00 27.21807571 1.423006398
2013-03-28 08:00:00+08:00 27.44888683 1.353644799
2013-03-29 08:00:00+08:00 27.66626757 1.328979238
2013-03-30 08:00:00+08:00 28.03215155 1.351655979
2013-03-31 08:00:00+08:00 28.34758652 1.419589908
I would like to find the range for each month for column of my choice. and group the months when there is a change in direction of range, Say for example: 3tier1 for the month 1 actually starts from 20 goes to 16 and then again goes to 22, ex: From Jan 1 to Jan 18 - downward 20 to 16 and then from Jan 19 to Feb 15 upward from 17 to 28 and so on and so forth,
Expected output:
2013-01-01 to 2013-01-18 - 20 to 16
2013-01-19 to 2013-02-15 - 17 to 28
Is there a builtin pandas function that can do this with ease? Thanks for your help in advance.
I don't know of built in function that does what you are looking for. It can be put together with enough lines of code. I would use .diff() and .shift().
This is what I came up with.
import pandas as pd
import numpy as np
file = 'C:/path_to_file/data.csv'
df = pd.read_csv(file, parse_dates=['Date'])
# Now I have your dataframe loaded. ** Your procedures are below.
df['trend'] = np.where(df['3tier1'].diff()>0,1,-1) # trend is increasing or decreasing
df['again'] = df['trend'].diff() # get the differnece in trend
df['again'] = df['again'].shift(periods=-1) + df['again']
df['change'] = np.where(df['again'].isin([2,-2,np.nan]), 2, 0)
# get to the desired data.
dfc = df[df['change']==2]
dfc['to_date'] = dfc['Date'].shift(periods=-1)
dfc['to_End'] = dfc['3tier1'].shift(periods=-1)
dfc.drop(columns=['trend', 'again','change'], inplace=True)
# get the rows that show the trend
dfc = dfc.iloc[::2, :]
print(dfc)

Add column to pandas pivot table based on complex condition

my pivot table looks like this:
In [285]: piv
Out[285]:
K 118.5 119.0 119.5 120.0 120.5
Expiry
2018-01-12 0.050842 0.050842 0.050842 0.050842 0.050842
2018-01-19 0.039526 0.039526 0.039526 0.039526 0.039526
2018-01-26 0.039196 0.039196 0.039196 0.039196 0.039196
2018-02-02 0.039991 0.039991 0.039991 0.039991 0.039991
2018-02-23 0.040005 0.040005 0.040005 0.040005 0.040005
2018-03-23 0.041025 0.041000 0.040872 0.040623 0.040398
and df2 looks like this:
In [290]: df2
Out[290]:
F Symbol
Expiry
2018-03-20 12:00:00 123.000000 ZN MAR 18
2018-06-20 12:00:00 122.609375 ZN JUN 18
I am looking to add piv['F'] based on the following:
piv.index.month < df2.index.month
so the result should looks like this:
K F 118.5 119.0 119.5 120.0 120.5
Expiry
2018-01-19 123.000000 0.039526 0.039526 0.039526 0.039526 0.039526
2018-01-26 123.000000 0.039196 0.039196 0.039196 0.039196 0.039196
2018-02-02 123.000000 0.039991 0.039991 0.039991 0.039991 0.039991
2018-02-23 123.000000 0.040005 0.040005 0.040005 0.040005 0.040005
2018-03-23 123.609375 0.041025 0.041000 0.040872 0.040623 0.040398
would help will be much appreciated.
reindex + backfill
df.index=pd.to_datetime(df.index)
df1.index=pd.to_datetime(df1.index)
df['F']=df1.reindex(df.index,method='backfill').F.values
df
Out[164]:
118.5 119.0 119.5 120.0 120.5 F
2018-01-12 0.050842 0.050842 0.050842 0.050842 0.050842 123.000000
2018-01-19 0.039526 0.039526 0.039526 0.039526 0.039526 123.000000
2018-01-26 0.039196 0.039196 0.039196 0.039196 0.039196 123.000000
2018-02-02 0.039991 0.039991 0.039991 0.039991 0.039991 123.000000
2018-02-23 0.040005 0.040005 0.040005 0.040005 0.040005 123.000000
2018-03-23 0.041025 0.041000 0.040872 0.040623 0.040398 122.609375
You want to use pd.merge_asof with direction='forward' and make sure to merge on the indices.
pd.merge_asof(
piv, df2[['F']],
left_index=True,
right_index=True,
direction='forward'
)
118.5 119.0 119.5 120.0 120.5 F
Expiry
2018-01-12 0.050842 0.050842 0.050842 0.050842 0.050842 123.000000
2018-01-19 0.039526 0.039526 0.039526 0.039526 0.039526 123.000000
2018-01-26 0.039196 0.039196 0.039196 0.039196 0.039196 123.000000
2018-02-02 0.039991 0.039991 0.039991 0.039991 0.039991 123.000000
2018-02-23 0.040005 0.040005 0.040005 0.040005 0.040005 123.000000
2018-03-23 0.041025 0.041000 0.040872 0.040623 0.040398 122.609375
And if you want 'F' in front:
pd.merge_asof(
piv, df2[['F']],
left_index=True,
right_index=True,
direction='forward'
).pipe(lambda d: d[['F']].join(d.drop('F', 1)))
F 118.5 119.0 119.5 120.0 120.5
Expiry
2018-01-12 123.000000 0.050842 0.050842 0.050842 0.050842 0.050842
2018-01-19 123.000000 0.039526 0.039526 0.039526 0.039526 0.039526
2018-01-26 123.000000 0.039196 0.039196 0.039196 0.039196 0.039196
2018-02-02 123.000000 0.039991 0.039991 0.039991 0.039991 0.039991
2018-02-23 123.000000 0.040005 0.040005 0.040005 0.040005 0.040005
2018-03-23 122.609375 0.041025 0.041000 0.040872 0.040623 0.040398

Select subset by a conditional expression from a PANDAS dataframe , but a error

a sample like this :
In [39]: ts = pd.Series(np.random.randn(20),index=pd.date_range('1/1/2000',periods=20))
In [40]: t = pd.DataFrame(ts,columns=['base'],index=ts.index)
In [42]: t['shift_one'] = t.base - t.base.shift(1)
In [43]: t['shift_two'] = t.shift_one.shift(1)
In [44]: t
Out[44]:
base shift_one shift_two
2000-01-01 -1.239924 NaN NaN
2000-01-02 1.116260 2.356184 NaN
2000-01-03 0.401853 -0.714407 2.356184
2000-01-04 -0.823275 -1.225128 -0.714407
2000-01-05 -0.562103 0.261171 -1.225128
2000-01-06 0.347143 0.909246 0.261171
.............
2000-01-20 -0.062557 -0.467466 0.512293
now , if we use t[t.shift_one > 0 ] , it works ok ,but when we use:
In [48]: t[t.shift_one > 0 and t.shift_two < 0]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in ()
----> 1 t[t.shift_one > 0 and t.shift_two < 0]
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Suppose we want to get a subset that include both two conditions , how to ? thanks a lot.
you need parens and use &, not and
see docs here:
http://pandas.pydata.org/pandas-docs/dev/indexing.html#boolean-indexing
In [3]: ts = pd.Series(np.random.randn(20),index=pd.date_range('1/1/2000',periods=20))
In [4]: t = pd.DataFrame(ts,columns=['base'],index=ts.index)
In [5]: t['shift_one'] = t.base - t.base.shift(1)
In [6]: t['shift_two'] = t.shift_one.shift(1)
In [7]: t
Out[7]:
base shift_one shift_two
2000-01-01 -1.116040 NaN NaN
2000-01-02 1.592079 2.708118 NaN
2000-01-03 0.958642 -0.633436 2.708118
2000-01-04 0.431970 -0.526672 -0.633436
2000-01-05 1.275624 0.843654 -0.526672
2000-01-06 0.498401 -0.777223 0.843654
2000-01-07 -0.351465 -0.849865 -0.777223
2000-01-08 -0.458742 -0.107277 -0.849865
2000-01-09 -2.100404 -1.641662 -0.107277
2000-01-10 0.601658 2.702062 -1.641662
2000-01-11 -2.026495 -2.628153 2.702062
2000-01-12 0.391426 2.417921 -2.628153
2000-01-13 -1.177292 -1.568718 2.417921
2000-01-14 -0.374543 0.802749 -1.568718
2000-01-15 0.338649 0.713192 0.802749
2000-01-16 -1.124820 -1.463469 0.713192
2000-01-17 0.484175 1.608995 -1.463469
2000-01-18 -1.477772 -1.961947 1.608995
2000-01-19 0.481843 1.959615 -1.961947
2000-01-20 0.760168 0.278325 1.959615
In [8]: t[(t.shift_one>0) & (t.shift_two<0)]
Out[8]:
base shift_one shift_two
2000-01-05 1.275624 0.843654 -0.526672
2000-01-10 0.601658 2.702062 -1.641662
2000-01-12 0.391426 2.417921 -2.628153
2000-01-14 -0.374543 0.802749 -1.568718
2000-01-17 0.484175 1.608995 -1.463469
2000-01-19 0.481843 1.959615 -1.961947