Plot pandas-data from Alpha Vantage - pandas

I just installed the alpha_vantage module in addition to pandas, and have gotten the api-key etc. I now want to plot som data regarding a stock.
Written in the module readme (see here) this is how you do it:
from alpha_vantage.timeseries import TimeSeries
import matplotlib.pyplot as plt
ts = TimeSeries(key='YOUR_API_KEY', output_format='pandas')
data, meta_data = ts.get_intraday(symbol='MSFT',interval='1min', outputsize='full')
data['4. close'].plot()
plt.title('Intraday Times Series for the MSFT stock (1 min)')
plt.show()
But when i write just the same in my project whit my own api-key, i get this error:
File "C:\Users\augbi\PycharmProjects\DjangoProsjekt\main\views.py", line 159, in <module>
data['4. close'].plot()
TypeError: tuple indices must be integers or slices, not str
I printed the data, so you can see the format here ( the top line is the result of data.index:
index: <built-in method index of tuple object at 0x0F7CBC28>
( 1. open 2. high 3. low 4. close 5. volume
date
2021-01-04 19:55:00 3179.94 3183.45 3179.94 3183.45 1317.0
2021-01-04 19:50:00 3179.00 3180.00 3178.26 3178.26 851.0
2021-01-04 18:52:00 3178.11 3178.11 3178.11 3178.11 648.0
2021-01-04 18:15:00 3177.00 3177.00 3177.00 3177.00 505.0
2021-01-04 18:09:00 3177.00 3177.00 3177.00 3177.00 224.0
... ... ... ... ... ...
2020-12-22 07:40:00 3212.78 3212.78 3212.78 3212.78 703.0
2020-12-22 07:34:00 3210.00 3210.00 3210.00 3210.00 755.0
2020-12-22 07:27:00 3208.19 3208.19 3208.19 3208.19 510.0
2020-12-22 07:14:00 3204.00 3204.00 3204.00 3204.00 216.0
2020-12-22 07:08:00 3204.00 3204.00 3204.00 3204.00 167.0
My own code is here:
ts2 = TimeSeries(key='ALPA_KEY', output_format='pandas')
data = ts2.get_intraday(symbol='AMZN',interval='1min', outputsize='full')
print("index:", data.index)
print(data)
data['4. close'].plot()
plt.title('Intraday Times Series for the AMZN stock (1 min)')
plt.show()
My wish is to plot the "4. Closing" column, and if possible the corresponding times. This data represents stock-prise for amazon for 1 day.
Thank you very much in advance!

I had an Alpah-Vantage APIKEY, so I used the following code to check. It didn't come back with an error as you asked. When I checked the output format, it was not the normal pandas format, but an extended format.
data
( 1. open 2. high 3. low 4. close 5. volume
date
2021-01-04 19:55:00 3179.94 3183.45 3179.94 3183.45 1317.0
2021-01-04 19:50:00 3179.00 3180.00 3178.26 3178.26 851.0
2021-01-04 18:52:00 3178.11 3178.11 3178.11 3178.11 648.0
2021-01-04 18:15:00 3177.00 3177.00 3177.00 3177.00 505.0
2021-01-04 18:09:00 3177.00 3177.00 3177.00 3177.00 224.0
... ... ... ... ... ...
2020-12-22 07:40:00 3212.78 3212.78 3212.78 3212.78 703.0
2020-12-22 07:34:00 3210.00 3210.00 3210.00 3210.00 755.0
2020-12-22 07:27:00 3208.19 3208.19 3208.19 3208.19 510.0
2020-12-22 07:14:00 3204.00 3204.00 3204.00 3204.00 216.0
2020-12-22 07:08:00 3204.00 3204.00 3204.00 3204.00 167.0
[3440 rows x 5 columns],
{'1. Information': 'Intraday (1min) open, high, low, close prices and volume',
'2. Symbol': 'AMZN',
'3. Last Refreshed': '2021-01-04 19:55:00',
'4. Interval': '1min',
'5. Output Size': 'Full size',
'6. Time Zone': 'US/Eastern'})
From this format, a normal data frame can be obtained with data[0], so a graph can be created. The following is the code to get the graph.
data[0]
1. open 2. high 3. low 4. close 5. volume
date
2021-01-04 19:55:00 3179.94 3183.45 3179.94 3183.45 1317.0
2021-01-04 19:50:00 3179.00 3180.00 3178.26 3178.26 851.0
2021-01-04 18:52:00 3178.11 3178.11 3178.11 3178.11 648.0
2021-01-04 18:15:00 3177.00 3177.00 3177.00 3177.00 505.0
2021-01-04 18:09:00 3177.00 3177.00 3177.00 3177.00 224.0
ts2 = TimeSeries(key=api_key, output_format='pandas')
data = ts2.get_intraday(symbol='AMZN',interval='1min', outputsize='full')
print("index:", data.index)
print(data)
data[0]['4. close'].plot()
plt.title('Intraday Times Series for the AMZN stock (1 min)')
plt.show()

Related

Extract a time and space variable from a moving ship from the ERA5 reanalysis

I want to extract the measured wind from a station inside a moving ship, which I have the latitude, longitude and time values and the wind value for each time step in space. I can extract a fixed point in space for all time steps but I would like to extract for example the wind at time step x to a date longitude and latitude as the ship moves. How can I do this from the code below?
data = xr.open_dataset('C:/Users/William Jacondino/Desktop/Dados/ERA5\\ERA5_2017.nc', decode_times=False)
dir_out = 'C:/Users/William Jacondino/Desktop/MovingShip'
if not os.path.exists(dir_out):
os.makedirs(dir_out)
print("\nReading the observation station names:\n")
stations = pd.read_csv(r"C:/Users/William Jacondino/Desktop/MovingShip/Date-TIME.csv",index_col=0, sep='\;')
print(stations)
Reading the observation station names:
Latitude Longitude
Date-Time
16/11/2017 00:00 0.219547 -38.247914
16/11/2017 06:00 0.861717 -38.188858
16/11/2017 12:00 1.529534 -38.131039
16/11/2017 18:00 2.243760 -38.067467
17/11/2017 00:00 2.961202 -38.009050
... ... ...
10/12/2017 00:00 -5.775127 -35.206581
10/12/2017 06:00 -5.775120 -35.206598
10/12/2017 12:00 -5.775119 -35.206583
10/12/2017 18:00 -5.775122 -35.206584
11/12/2017 00:00 -5.775115 -35.206590
# variável tempo e unidade
times = data.variables['time'][:]
unit = data.time.units
# variáveis latitude (lat) e longitude (lon)
lon = data.variables['longitude'][:]
lat = data.variables['latitude'][:]
# variável temperatura em 2 metros em celsius
temp = data.variables['t2m'][:]-275.15
# variável temperatura do ponto de orvalho em 2 metros em celsius
tempdw = data.variables['d2m'][:]-275.15
# variável sea surface temperature (sst) em celsius
sst = data.variables['sst'][:]-275.15
# variável Surface sensible heat flux sshf
sshf = data.variables['sshf'][:]
unitsshf = data.sshf.units
# variável Surface latent heat flux
slhf = data.variables['slhf'][:]
unitslhf = data.slhf.units
# variável Mean sea level pressure
msl = data.variables['msl'][:]/100
unitmsl = data.msl.units
# variável Total precipitation em mm/h
tp = data.variables['tp'][:]*1000
# componente zonal do vento em 100 metros
uten100 = data.variables['u100'][:]
unitu100 = data.u100.units
# componente meridional do vento em 100 metros
vten100 = data.variables['v100'][:]
unitv100 = data.v100.units
# componente zonal do vento em 10 metros
uten = data.variables['u10'][:]
unitu = data.u10.units
# componente meridional do vento em 10 metros
vten = data.variables['v10'][:]
unitv = data.v10.units
# calculando a velocidade do vento em 10 metros
ws = (uten**2 + vten**2)**(0.5)
# calculando a velocidade do vento em 100 metros
ws100 = (uten100**2 + vten100**2)**(0.5)
# calculando os ângulos de U e V para obter a direção do vento em 10 metros
wdir = (180 + (np.degrees(np.arctan2(uten, vten)))) % 360
# calculando os ângulos de U e V para obter a direção do vento em 100 metros
wdir100 = (180 + (np.degrees(np.arctan2(uten100, vten100)))) % 360
for key, value in stations.iterrows():
#print(key,value[0], value[1], value[2])
station = value[0]
file_name = "{}{}".format(station+'_1991',".csv")
#print(file_name)
lon_point = value[1]
lat_point = value[2]
########################################
# Encontrando o ponto de Latitude e Longitude mais próximo das estações
# Squared difference of lat and lon
sq_diff_lat = (lat - lat_point)**2
sq_diff_lon = (lon - lon_point)**2
# Identifying the index of the minimum value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
print("Generating time series for station {}".format(station))
ref_date = datetime.datetime(int(unit[12:16]),int(unit[17:19]),int(unit[20:22]))
date_range = list()
temp_data = list()
tempdw_data = list()
sst_data = list()
sshf_data = list()
slhf_data = list()
msl_data = list()
tp_data = list()
uten100_data = list()
vten100_data = list()
uten_data = list()
vten_data = list()
ws_data = list()
ws100_data = list()
wdir_data = list()
wdir100_data = list()
for index, time in enumerate(times):
date_time = ref_date+datetime.timedelta(hours=int(time))
date_range.append(date_time)
temp_data.append(temp[index, min_index_lat, min_index_lon].values)
tempdw_data.append(tempdw[index, min_index_lat, min_index_lon].values)
sst_data.append(sst[index, min_index_lat, min_index_lon].values)
sshf_data.append(sshf[index, min_index_lat, min_index_lon].values)
slhf_data.append(slhf[index, min_index_lat, min_index_lon].values)
msl_data.append(msl[index, min_index_lat, min_index_lon].values)
tp_data.append(tp[index, min_index_lat, min_index_lon].values)
uten100_data.append(uten100[index, min_index_lat, min_index_lon].values)
vten100_data.append(vten100[index, min_index_lat, min_index_lon].values)
uten_data.append(uten[index, min_index_lat, min_index_lon].values)
vten_data.append(vten[index, min_index_lat, min_index_lon].values)
ws_data.append(ws[index,min_index_lat,min_index_lon].values)
ws100_data.append(ws100[index,min_index_lat,min_index_lon].values)
wdir_data.append(wdir[index,min_index_lat,min_index_lon].values)
wdir100_data.append(wdir100[index,min_index_lat,min_index_lon].values)
################################################################################
#print(date_range)
df = pd.DataFrame(date_range, columns = ["Date-Time"])
df["Date-Time"] = date_range
df = df.set_index(["Date-Time"])
df["WS10 ({})".format(unitu)] = ws_data
df["WDIR10 ({})".format(units.deg)] = wdir_data
df["WS100 ({})".format(unitu)] = ws100_data
df["WDIR100 ({})".format(units.deg)] = wdir100_data
df["Chuva({})".format(units.mm)] = tp_data
df["MSLP ({})".format(units.hPa)] = msl_data
df["T2M ({})".format(units.degC)] = temp_data
df["Td2M ({})".format(units.degC)] = tempdw_data
df["Surface Sensible Heat Flux ({})".format(unitsshf)] = sshf_data
df["Surface latent heat flux ({})".format(unitslhf)] = slhf_data
df["U10 ({})".format(unitu)] = uten_data
df["V10 ({})".format(unitv)] = vten_data
df["U100 ({})".format(unitu100)] = uten100_data
df["V100 ({})".format(unitv100)] = vten100_data
df["TSM ({})".format(units.degC)] = sst_data
print("The following time series is being saved as .csv files")
df.to_csv(os.path.join(dir_out,file_name), sep=';',encoding="utf-8", index=True)
print("\n! !Successfuly saved all the Time Series the output Directory!!\n{}".format(dir_out))
My code to extract a fixed variable at a given point in space is like this, but I would like to extract during the ship's movement, for example at time 11/12/2017 00:00, latitude -5.775115 and longitude -35.206590 I have a value of the wind, and in the next time step for another latitude x longitude I have another value. How can I adapt my code for this?
This is another perfect use case for xarray's advanced indexing! I feel like this part of the user guide needs a cape and a theme song :)
I'll use a made up dataset and set of stations which are similar (I think) to yours. First step is to reset your Date-Time index, so you can use it in pulling the nearest time value from the xarray.Dataset, since you want a common index for the time, lat, and lon:
In [14]: stations = stations.reset_index(drop=False)
...: stations
Out[14]:
Date-Time Latitude Longitude
0 2017-11-16 00:00:00 0.219547 -38.247914
1 2017-11-16 06:00:00 0.861717 -38.188858
2 2017-11-16 12:00:00 1.529534 -38.131039
3 2017-11-16 18:00:00 2.243760 -38.067467
4 2017-11-17 00:00:00 2.961202 -38.009050
5 2017-12-10 00:00:00 -5.775127 -35.206581
6 2017-12-10 06:00:00 -5.775120 -35.206598
7 2017-12-10 12:00:00 -5.775119 -35.206583
8 2017-12-10 18:00:00 -5.775122 -35.206584
9 2017-12-11 00:00:00 -5.775115 -35.206590
In [15]: ds
Out[15]:
<xarray.Dataset>
Dimensions: (lat: 40, lon: 40, time: 241)
Coordinates:
* lat (lat) float64 -9.75 -9.25 -8.75 -8.25 -7.75 ... 8.25 8.75 9.25 9.75
* lon (lon) float64 -44.75 -44.25 -43.75 -43.25 ... -26.25 -25.75 -25.25
* time (time) datetime64[ns] 2017-11-01 2017-11-01T06:00:00 ... 2017-12-31
Data variables:
temp (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
tempdw (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
sst (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
ws (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
ws100 (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
wdir (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
wdir100 (lat, lon, time) float64 0.07366 0.3448 0.2456 ... 0.3081 0.4236
Using the advanced indexing rules, if we select from the dataset using DataArrays as indexers, the result will be reshaped to match the indexer. What this means is that we can take your stations dataframe, which has the values time, lat, and lon, and pull the nearest indices from the xarray dataset:
In [16]: ds_over_observations = ds.sel(
...: time=stations["Date-Time"].to_xarray(),
...: lat=stations["Latitude"].to_xarray(),
...: lon=stations["Longitude"].to_xarray(),
...: method="nearest",
...: )
Now, our data has the same index as your dataframe!
In [17]: ds_over_observations
Out[17]:
<xarray.Dataset>
Dimensions: (index: 10)
Coordinates:
lat (index) float64 0.25 0.75 1.75 2.25 ... -5.75 -5.75 -5.75 -5.75
lon (index) float64 -38.25 -38.25 -38.25 ... -35.25 -35.25 -35.25
time (index) datetime64[ns] 2017-11-16 ... 2017-12-11
* index (index) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
temp (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
tempdw (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
sst (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
ws (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
ws100 (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
wdir (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
wdir100 (index) float64 0.1887 0.222 0.6754 0.919 ... 0.1134 0.9231 0.6095
You can dump this into pandas with .to_dataframe:
In [18]: df = ds_over_observations.to_dataframe()
In [19]: df
Out[19]:
lat lon time temp tempdw sst ws ws100 wdir wdir100
index
0 0.25 -38.25 2017-11-16 00:00:00 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724
1 0.75 -38.25 2017-11-16 06:00:00 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025
2 1.75 -38.25 2017-11-16 12:00:00 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417
3 2.25 -38.25 2017-11-16 18:00:00 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019
4 2.75 -38.25 2017-11-17 00:00:00 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266
5 -5.75 -35.25 2017-12-10 00:00:00 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490
6 -5.75 -35.25 2017-12-10 06:00:00 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541
7 -5.75 -35.25 2017-12-10 12:00:00 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352
8 -5.75 -35.25 2017-12-10 18:00:00 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058
9 -5.75 -35.25 2017-12-11 00:00:00 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493
The index in the result is the same one as the stations data. If you'd like, you could merge in the original values using pd.concat([stations, df], axis=1).set_index("Date-Time") to get your original index back, alongside all the weather data:
In [20]: pd.concat([stations, df], axis=1).set_index("Date-Time")
Out[20]:
Latitude Longitude lat lon time temp tempdw sst ws ws100 wdir wdir100
Date-Time
2017-11-16 00:00:00 0.219547 -38.247914 0.25 -38.25 2017-11-16 00:00:00 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724 0.188724
2017-11-16 06:00:00 0.861717 -38.188858 0.75 -38.25 2017-11-16 06:00:00 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025 0.222025
2017-11-16 12:00:00 1.529534 -38.131039 1.75 -38.25 2017-11-16 12:00:00 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417 0.675417
2017-11-16 18:00:00 2.243760 -38.067467 2.25 -38.25 2017-11-16 18:00:00 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019 0.919019
2017-11-17 00:00:00 2.961202 -38.009050 2.75 -38.25 2017-11-17 00:00:00 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266 0.566266
2017-12-10 00:00:00 -5.775127 -35.206581 -5.75 -35.25 2017-12-10 00:00:00 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490 0.652490
2017-12-10 06:00:00 -5.775120 -35.206598 -5.75 -35.25 2017-12-10 06:00:00 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541 0.429541
2017-12-10 12:00:00 -5.775119 -35.206583 -5.75 -35.25 2017-12-10 12:00:00 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352 0.113352
2017-12-10 18:00:00 -5.775122 -35.206584 -5.75 -35.25 2017-12-10 18:00:00 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058 0.923058
2017-12-11 00:00:00 -5.775115 -35.206590 -5.75 -35.25 2017-12-11 00:00:00 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493 0.609493

Apply customized functions in pandas groupby and panel data

I have a panel data as follows:
volume VWAP open close high low n ticker date
time
2021-09-02 09:30:00 597866 110.2781 110.32 110.37 110.4900 110.041 3719.0 AMD 2021-09-02
2021-09-02 09:31:00 512287 109.9928 110.36 109.85 110.4000 109.725 3732.0 AMD 2021-09-02
2021-09-02 09:32:00 359379 109.7271 109.81 109.89 109.9600 109.510 2455.0 AMD 2021-09-02
2021-09-02 09:33:00 368225 109.5740 109.89 109.66 109.8900 109.420 2555.0 AMD 2021-09-02
2021-09-02 09:34:00 320260 109.5616 109.67 109.45 109.8299 109.390 2339.0 AMD 2021-09-02
... ... ... ... ... ... ... ... ... ...
2021-12-31 15:56:00 62680 3334.8825 3332.24 3337.60 3337.8500 3331.890 2334.0 AMZN 2021-12-31
2021-12-31 15:57:00 26046 3336.0700 3337.70 3335.72 3338.6000 3334.990 1292.0 AMZN 2021-12-31
2021-12-31 15:58:00 47989 3336.3885 3334.65 3337.23 3338.0650 3334.650 1651.0 AMZN 2021-12-31
2021-12-31 15:59:00 63865 3335.5288 3336.70 3334.72 3337.3700 3334.180 2172.0 AMZN 2021-12-31
2021-12-31 16:00:00 1974 3334.8869 3334.34 3334.34 3334.3400 3334.340 108.0 AMZN 2021-12-31
153700 rows × 9 columns
I would like to calculate a series of attributes engeered from the panel data. These functions are pre-written and posted on github https://github.com/twopirllc/pandas-ta/blob/main/pandas_ta/overlap/ema.py. In doctor Jansen's example, he used
import pandas_ta as ta
import pandas as pd
df["feature"] = df.groupby("ticker", group_keys = False).apply(lambda x: ta.ema(x.close))
I was able to follow along using google cloud's compute engine under python 3.7. However, when I use my school's cluster with python 3.8, eventhough with the same pandas version, it would not work. I also tried using the same version of python. Unfortunately it did not work as well.
df.groupby("ticker").apply(lambda x: ta.ema(x.close, 200))
output:
ticker time
AAPL 2021-09-02 09:30:00 NaN
2021-09-02 09:31:00 NaN
2021-09-02 09:32:00 NaN
2021-09-02 09:33:00 NaN
2021-09-02 09:34:00 NaN
...
TSLA 2021-12-31 15:56:00 1064.446659
2021-12-31 15:57:00 1064.358135
2021-12-31 15:58:00 1064.278452
2021-12-31 15:59:00 1064.207621
2021-12-31 16:00:00 1064.135904
Name: EMA_200, Length: 153700, dtype: float64
df["alpha_01"] = df.groupby("ticker").apply(lambda x: ta.ema(x.close))
output:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:10772, in _reindex_for_setitem(value, index)
10771 try:
> 10772 reindexed_value = value.reindex(index)._values
10773 except ValueError as err:
10774 # raised in MultiIndex.from_tuples, see test_insert_error_msmgs
File ~/quant/lib/python3.8/site-packages/pandas/core/series.py:4579, in Series.reindex(self, index, **kwargs)
4571 #doc(
4572 NDFrame.reindex, # type: ignore[has-type]
4573 klass=_shared_doc_kwargs["klass"],
(...)
4577 )
4578 def reindex(self, index=None, **kwargs):
-> 4579 return super().reindex(index=index, **kwargs)
File ~/quant/lib/python3.8/site-packages/pandas/core/generic.py:4809, in NDFrame.reindex(self, *args, **kwargs)
4808 # perform the reindex on the axes
-> 4809 return self._reindex_axes(
4810 axes, level, limit, tolerance, method, fill_value, copy
4811 ).__finalize__(self, method="reindex")
File ~/quant/lib/python3.8/site-packages/pandas/core/generic.py:4825, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
4824 ax = self._get_axis(a)
-> 4825 new_index, indexer = ax.reindex(
4826 labels, level=level, limit=limit, tolerance=tolerance, method=method
4827 )
4829 axis = self._get_axis_number(a)
File ~/quant/lib/python3.8/site-packages/pandas/core/indexes/multi.py:2533, in MultiIndex.reindex(self, target, method, level, limit, tolerance)
2532 try:
-> 2533 target = MultiIndex.from_tuples(target)
2534 except TypeError:
2535 # not all tuples, see test_constructor_dict_multiindex_reindex_flat
File ~/quant/lib/python3.8/site-packages/pandas/core/indexes/multi.py:202, in names_compat.<locals>.new_meth(self_or_cls, *args, **kwargs)
200 kwargs["names"] = kwargs.pop("name")
--> 202 return meth(self_or_cls, *args, **kwargs)
File ~/quant/lib/python3.8/site-packages/pandas/core/indexes/multi.py:553, in MultiIndex.from_tuples(cls, tuples, sortorder, names)
551 tuples = np.asarray(tuples._values)
--> 553 arrays = list(lib.tuples_to_object_array(tuples).T)
554 elif isinstance(tuples, list):
File ~/quant/lib/python3.8/site-packages/pandas/_libs/lib.pyx:2919, in pandas._libs.lib.tuples_to_object_array()
ValueError: cannot include dtype 'M' in a buffer
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
Input In [19], in <cell line: 1>()
----> 1 df_features["alpha_01"] = df.groupby("ticker").apply(lambda x: ta.ema(x.close))
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:3607, in DataFrame.__setitem__(self, key, value)
3604 self._setitem_array([key], value)
3605 else:
3606 # set column
-> 3607 self._set_item(key, value)
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:3779, in DataFrame._set_item(self, key, value)
3769 def _set_item(self, key, value) -> None:
3770 """
3771 Add series to DataFrame in specified column.
3772
(...)
3777 ensure homogeneity.
3778 """
-> 3779 value = self._sanitize_column(value)
3781 if (
3782 key in self.columns
3783 and value.ndim == 1
3784 and not is_extension_array_dtype(value)
3785 ):
3786 # broadcast across multiple columns if necessary
3787 if not self.columns.is_unique or isinstance(self.columns, MultiIndex):
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:4501, in DataFrame._sanitize_column(self, value)
4499 # We should never get here with DataFrame value
4500 if isinstance(value, Series):
-> 4501 return _reindex_for_setitem(value, self.index)
4503 if is_list_like(value):
4504 com.require_length_match(value, self.index)
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:10779, in _reindex_for_setitem(value, index)
10775 if not value.index.is_unique:
10776 # duplicate axis
10777 raise err
> 10779 raise TypeError(
10780 "incompatible index of inserted column with frame index"
10781 ) from err
10782 return reindexed_value
TypeError: incompatible index of inserted column with frame index
df_features["alpha_01"] = df.groupby("ticker", group_keys = False).apply(lambda x: ta.ema(x.close))
output:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [20], in <cell line: 1>()
----> 1 df_features["alpha_01"] = df.groupby("ticker", group_keys = False).apply(lambda x: ta.ema(x.close))
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:3607, in DataFrame.__setitem__(self, key, value)
3604 self._setitem_array([key], value)
3605 else:
3606 # set column
-> 3607 self._set_item(key, value)
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:3779, in DataFrame._set_item(self, key, value)
3769 def _set_item(self, key, value) -> None:
3770 """
3771 Add series to DataFrame in specified column.
3772
(...)
3777 ensure homogeneity.
3778 """
-> 3779 value = self._sanitize_column(value)
3781 if (
3782 key in self.columns
3783 and value.ndim == 1
3784 and not is_extension_array_dtype(value)
3785 ):
3786 # broadcast across multiple columns if necessary
3787 if not self.columns.is_unique or isinstance(self.columns, MultiIndex):
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:4501, in DataFrame._sanitize_column(self, value)
4499 # We should never get here with DataFrame value
4500 if isinstance(value, Series):
-> 4501 return _reindex_for_setitem(value, self.index)
4503 if is_list_like(value):
4504 com.require_length_match(value, self.index)
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:10777, in _reindex_for_setitem(value, index)
10773 except ValueError as err:
10774 # raised in MultiIndex.from_tuples, see test_insert_error_msmgs
10775 if not value.index.is_unique:
10776 # duplicate axis
> 10777 raise err
10779 raise TypeError(
10780 "incompatible index of inserted column with frame index"
10781 ) from err
10782 return reindexed_value
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:10772, in _reindex_for_setitem(value, index)
10770 # GH#4107
10771 try:
> 10772 reindexed_value = value.reindex(index)._values
10773 except ValueError as err:
10774 # raised in MultiIndex.from_tuples, see test_insert_error_msmgs
10775 if not value.index.is_unique:
10776 # duplicate axis
File ~/quant/lib/python3.8/site-packages/pandas/core/series.py:4579, in Series.reindex(self, index, **kwargs)
4571 #doc(
4572 NDFrame.reindex, # type: ignore[has-type]
4573 klass=_shared_doc_kwargs["klass"],
(...)
4577 )
4578 def reindex(self, index=None, **kwargs):
-> 4579 return super().reindex(index=index, **kwargs)
File ~/quant/lib/python3.8/site-packages/pandas/core/generic.py:4809, in NDFrame.reindex(self, *args, **kwargs)
4806 return self._reindex_multi(axes, copy, fill_value)
4808 # perform the reindex on the axes
-> 4809 return self._reindex_axes(
4810 axes, level, limit, tolerance, method, fill_value, copy
4811 ).__finalize__(self, method="reindex")
File ~/quant/lib/python3.8/site-packages/pandas/core/generic.py:4830, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
4825 new_index, indexer = ax.reindex(
4826 labels, level=level, limit=limit, tolerance=tolerance, method=method
4827 )
4829 axis = self._get_axis_number(a)
-> 4830 obj = obj._reindex_with_indexers(
4831 {axis: [new_index, indexer]},
4832 fill_value=fill_value,
4833 copy=copy,
4834 allow_dups=False,
4835 )
4837 return obj
File ~/quant/lib/python3.8/site-packages/pandas/core/generic.py:4874, in NDFrame._reindex_with_indexers(self, reindexers, fill_value, copy, allow_dups)
4871 indexer = ensure_platform_int(indexer)
4873 # TODO: speed up on homogeneous DataFrame objects
-> 4874 new_data = new_data.reindex_indexer(
4875 index,
4876 indexer,
4877 axis=baxis,
4878 fill_value=fill_value,
4879 allow_dups=allow_dups,
4880 copy=copy,
4881 )
4882 # If we've made a copy once, no need to make another one
4883 copy = False
File ~/quant/lib/python3.8/site-packages/pandas/core/internals/managers.py:663, in BaseBlockManager.reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy, consolidate, only_slice)
661 # some axes don't allow reindexing with dups
662 if not allow_dups:
--> 663 self.axes[axis]._validate_can_reindex(indexer)
665 if axis >= self.ndim:
666 raise IndexError("Requested axis not found in manager")
File ~/quant/lib/python3.8/site-packages/pandas/core/indexes/base.py:3785, in Index._validate_can_reindex(self, indexer)
3783 # trying to reindex on an axis with duplicates
3784 if not self._index_as_unique and len(indexer):
-> 3785 raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis
The data and the ipynb is avaliable via this link: https://drive.google.com/drive/folders/1QnIdYnDFs8XNk7L8KFzCHC_YJPDo618t?usp=sharing
Ideal output:
df["new_col"] = df.groupby().apply() # without writing any additional helper function
The apply function following the dataframe has the following output:
df.groupby("ticker").apply(lambda x: ta.ema(x.close, 200))
output:
ticker time
AAPL 2021-09-02 09:30:00 NaN
2021-09-02 09:31:00 NaN
2021-09-02 09:32:00 NaN
2021-09-02 09:33:00 NaN
2021-09-02 09:34:00 NaN
...
TSLA 2021-12-31 15:56:00 1064.446659
2021-12-31 15:57:00 1064.358135
2021-12-31 15:58:00 1064.278452
2021-12-31 15:59:00 1064.207621
2021-12-31 16:00:00 1064.135904
Name: EMA_200, Length: 153700, dtype: float64
We want make the dataframe to be appended to have the identical multi-index columns.
df_features = df.reset_index().groupby([pd.Grouper(key = "ticker"), "time"]).sum()
df_features
out:
volume VWAP open close high low n
ticker time
AAPL 2021-09-02 09:30:00 1844930 154.0857 153.8700 154.4300 154.4402 153.8600 9899.0
2021-09-02 09:31:00 565141 154.2679 154.4299 154.0600 154.4600 154.0600 5132.0
2021-09-02 09:32:00 524794 154.1198 154.0600 154.2339 154.3700 153.8500 4036.0
2021-09-02 09:33:00 504479 154.3071 154.2305 154.4750 154.4800 154.1600 4171.0
2021-09-02 09:34:00 794989 154.5478 154.4800 154.4906 154.7100 154.4206 5019.0
... ... ... ... ... ... ... ... ...
TSLA 2021-12-31 15:56:00 91296 1055.9030 1055.4900 1055.9400 1056.3200 1055.3200 2360.0
2021-12-31 15:57:00 104648 1056.0563 1055.9850 1055.5500 1056.4300 1055.5500 2988.0
2021-12-31 15:58:00 149130 1055.6994 1055.5500 1056.3500 1056.8000 1054.5900 3603.0
2021-12-31 15:59:00 189018 1056.4131 1056.2900 1057.1600 1057.2400 1056.0700 4214.0
2021-12-31 16:00:00 37983 1056.3289 1057.0100 1057.0000 1057.1000 1056.0000 319.0
153700 rows × 7 columns
Then append the calculated series to this dataframe.
df_features["alpha_01"] = df.groupby("ticker").parallel_apply(lambda x: ta.ema(x.close, 200))
df_features
out:
volume VWAP open close high low n alpha_01
ticker time
AAPL 2021-09-02 09:30:00 1844930 154.0857 153.8700 154.4300 154.4402 153.8600 9899.0 NaN
2021-09-02 09:31:00 565141 154.2679 154.4299 154.0600 154.4600 154.0600 5132.0 NaN
2021-09-02 09:32:00 524794 154.1198 154.0600 154.2339 154.3700 153.8500 4036.0 NaN
2021-09-02 09:33:00 504479 154.3071 154.2305 154.4750 154.4800 154.1600 4171.0 NaN
2021-09-02 09:34:00 794989 154.5478 154.4800 154.4906 154.7100 154.4206 5019.0 NaN
... ... ... ... ... ... ... ... ... ...
TSLA 2021-12-31 15:56:00 91296 1055.9030 1055.4900 1055.9400 1056.3200 1055.3200 2360.0 1064.446659
2021-12-31 15:57:00 104648 1056.0563 1055.9850 1055.5500 1056.4300 1055.5500 2988.0 1064.358135
2021-12-31 15:58:00 149130 1055.6994 1055.5500 1056.3500 1056.8000 1054.5900 3603.0 1064.278452
2021-12-31 15:59:00 189018 1056.4131 1056.2900 1057.1600 1057.2400 1056.0700 4214.0 1064.207621
2021-12-31 16:00:00 37983 1056.3289 1057.0100 1057.0000 1057.1000 1056.0000 319.0 1064.135904
153700 rows × 8 columns

Get stock Low of Day (LOD) price for incomplete daily bar using minute bar data (multiple stocks, multiple sessions in one df) SettingWithCopyWarning

I have a dataframe of minute data for multiple Stocks, each stock has multiple sessions. See sample below
Symbol Time Open High Low Close Volume LOD
2724312 AEHR 2019-09-23 09:31:00 1.42 1.42 1.42 1.42 200 NaN
2724313 AEHR 2019-09-23 09:43:00 1.35 1.35 1.34 1.34 6062 NaN
2724314 AEHR 2019-09-23 09:58:00 1.35 1.35 1.29 1.30 8665 NaN
2724315 AEHR 2019-09-23 09:59:00 1.32 1.32 1.32 1.32 100 NaN
2724316 AEHR 2019-09-23 10:00:00 1.35 1.35 1.35 1.35 400 NaN
... ... ... ... ... ... ... ... ...
4266341 ZI 2021-09-10 15:56:00 63.08 63.16 63.08 63.15 18205 NaN
4266342 ZI 2021-09-10 15:57:00 63.14 63.14 63.07 63.07 19355 NaN
4266343 ZI 2021-09-10 15:58:00 63.07 63.12 63.07 63.10 16650 NaN
4266344 ZI 2021-09-10 15:59:00 63.09 63.12 63.06 63.11 25775 NaN
4266345 ZI 2021-09-10 16:00:00 63.11 63.17 63.11 63.17 28578 NaN
I need the Low Of Day(LOD) for the session (9:30-4pm) up to the time in each row.
The completed df should look like this
Symbol Time Open High Low Close Volume LOD
2724312 AEHR 2019-09-23 09:31:00 1.42 1.42 1.42 1.42 200 1.42
2724313 AEHR 2019-09-23 09:43:00 1.35 1.35 1.34 1.34 6062 1.34
2724314 AEHR 2019-09-23 09:58:00 1.35 1.35 1.29 1.30 8665 1.29
2724315 AEHR 2019-09-23 09:59:00 1.32 1.32 1.32 1.32 100 1.29
2724316 AEHR 2019-09-23 10:00:00 1.35 1.35 1.35 1.35 400 1.29
... ... ... ... ... ... ... ... ...
4266341 ZI 2021-09-10 15:56:00 63.08 63.16 63.08 63.15 18205 63.08
4266342 ZI 2021-09-10 15:57:00 63.14 63.14 63.07 63.07 19355 63.07
4266343 ZI 2021-09-10 15:58:00 63.07 63.12 63.07 63.10 16650 63.07
4266344 ZI 2021-09-10 15:59:00 63.09 63.12 63.06 63.11 25775 63.06
4266345 ZI 2021-09-10 16:00:00 63.11 63.17 63.11 63.17 28578 63.06
My current solution
prev_symbol = "WXYZ"
prev_low = 10000000
prev_session = datetime.date(1920, 1, 1)
session_start = 1
for i, row in df.iterrows():
current_session = (df['Time'].iloc[i]).time()
current_symbol = df['Symbol'].iloc[i]
if current_symbol == prev_symbol:
if current_session == prev_session:
sesh_low = df.iloc[session_start:i, 'Low'].min()
df.at[i, 'LOD'] = sesh_low
else:
df.at[i, 'LOD'] = df.at[i, 'Low']
prev_session = current_session
session_start = i
else:
df.at[i, 'LOD'] = df.at[i, 'Low']
prev_symbol = current_symbol
prev_session = current_session
session_start = i
This returns a SettingWithCopyWarning error. Please help
You can try .groupby() + .expanding():
# if you have values already converted/sorted, skip:
# df["Time"] = pd.to_datetime(df["Time"])
# df = df.sort_values(by=["Symbol", "Time"])
df["LOD"] = df.groupby("Symbol")["Low"].expanding().min().values
print(df)
Prints:
Symbol Time Open High Low Close Volume LOD
2724312 AEHR 2019-09-23 09:31:00 1.42 1.42 1.42 1.42 200 1.42
2724313 AEHR 2019-09-23 09:43:00 1.35 1.35 1.34 1.34 6062 1.34
2724314 AEHR 2019-09-23 09:58:00 1.35 1.35 1.29 1.30 8665 1.29
2724315 AEHR 2019-09-23 09:59:00 1.32 1.32 1.32 1.32 100 1.29
2724316 AEHR 2019-09-23 10:00:00 1.35 1.35 1.35 1.35 400 1.29
4266341 ZI 2021-09-10 15:56:00 63.08 63.16 63.08 63.15 18205 63.08
4266342 ZI 2021-09-10 15:57:00 63.14 63.14 63.07 63.07 19355 63.07
4266343 ZI 2021-09-10 15:58:00 63.07 63.12 63.07 63.10 16650 63.07
4266344 ZI 2021-09-10 15:59:00 63.09 63.12 63.06 63.11 25775 63.06
4266345 ZI 2021-09-10 16:00:00 63.11 63.17 63.11 63.17 28578 63.06

missing observation panel data, bring forward value 20 periods

Here's to read in a DataFrame like the one I'm looking at
pd.DataFrame({
'period' : [1, 2, 3, 4, 5, 8, 9, 10, 11, 13, 14, 15, 16, 19, 20, 21, 22,
23, 25, 26],
'id' : [1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285,
1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285],
'pred': [-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775],
'ret' : [ None, -0.02222222, -0.01363636, 0. , -0.02764977,
None, -0.00909091, -0.01376147, 0.00465116, None,
0.01869159, 0. , 0. , None , -0.00460829,
0.00462963, 0.02304147, 0. , None, -0.00050756]})
Which will look like this when read in.
period id pred ret
0 1 1285 -1.653477 NaN
1 2 1285 -1.653477 -0.022222
2 3 1285 -1.653477 -0.013636
3 4 1285 -1.653477 0.000000
4 5 1285 -1.653477 -0.027650
5 8 1285 -1.653477 NaN
6 9 1285 -1.653477 -0.009091
7 10 1285 -1.653477 -0.013761
8 11 1285 -1.653477 0.004651
9 13 1285 -1.653477 NaN
10 14 1285 -1.653477 0.018692
11 15 1285 -1.653477 0.000000
12 16 1285 -1.653477 0.000000
13 19 1285 -1.653477 NaN
14 20 1285 -1.653477 -0.004608
15 21 1285 -1.653477 0.004630
16 22 1285 -1.653477 0.023041
17 23 1285 -1.653477 0.000000
18 25 1285 -1.653477 NaN
19 26 1285 -1.653477 -0.000508
pred is 20 period prediction and consequently I want to do is bring the returns back 20 days. (but do it in a flexible way.)
Here's the lag function I have presently
def lag(df, col, lag_dist=1, ref='period', group='id'):
df = df.copy()
new_col = 'lag'+str(lag_dist)+'_'+col
df[new_col] = df.groupby(group)[col].shift(lag_dist)
# set NaN values that differ from specified
df[new_col] = (df.groupby(group)[ref]
.shift(lag_dist)
.sub(df[ref])
.eq(-lag_dist)
.mul(1)
.replace(0,np.nan)*df[new_col])
return df[new_col]
but when I run
df['fut20_ret'] = lag(df, 'ret', -20, 'period')
df.head(20)
I get
period id pred gain fee prc ret fut20_ret
0 1 1285 -1.653478 0.000000 0.87 1.000000 NaN NaN
1 2 1285 -1.653478 -0.022222 0.87 0.977778 -0.022222 NaN
2 3 1285 -1.653478 -0.035556 0.87 0.964444 -0.013636 NaN
3 4 1285 -1.653478 -0.035556 0.87 0.964444 0.000000 NaN
4 5 1285 -1.653478 -0.062222 0.87 0.937778 -0.027650 NaN
6 8 1285 -1.653478 -0.022222 0.87 0.977778 NaN NaN
7 9 1285 -1.653478 -0.031111 0.87 0.968889 -0.009091 NaN
8 10 1285 -1.653478 -0.044444 0.87 0.955556 -0.013761 NaN
9 11 1285 -1.653478 -0.040000 0.87 0.960000 0.004651 NaN
10 13 1285 -1.653478 -0.048889 0.87 0.951111 NaN NaN
11 14 1285 -1.653478 -0.031111 0.87 0.968889 0.018692 NaN
12 15 1285 -1.653478 -0.031111 0.87 0.968889 0.000000 NaN
13 16 1285 -1.653478 -0.031111 0.87 0.968889 0.000000 NaN
15 19 1285 -1.653478 -0.035556 0.87 0.964444 NaN NaN
16 20 1285 -1.653478 -0.040000 0.87 0.960000 -0.004608 NaN
17 21 1285 -1.653478 -0.035556 0.87 0.964444 0.004630 NaN
18 22 1285 -1.653478 -0.013333 0.87 0.986667 0.023041 NaN
19 23 1285 -1.653478 -0.013333 0.87 0.986667 0.000000 NaN
How can I modify my lag function so that it works properly? It's close but I'm struggling on the last little bit.

VerifyError: Bad type on operand stack dropwizard

We upgraded the java version to 11 in a microservice.
When we tried to run the app, we got the following message:
Caused by: java.lang.VerifyError: Bad type on operand stack
Exception Details:
Location:
com/template/main/App.initialize(Lio/dropwizard/setup/Bootstrap;)V #153: invokespecial
Reason:
Type 'io/dropwizard/configuration/EnvironmentVariableSubstitutor' (current frame, stack[4]) is not assignable to 'org/apache/commons/text/StrSubstitutor'
Current Frame:
bci: #153
flags: { }
locals: { 'com/template/main/App', 'io/dropwizard/setup/Bootstrap', 'com/bendb/dropwizard/jooq/JooqBundle' }
stack: { 'io/dropwizard/setup/Bootstrap', uninitialized 137, uninitialized 137, 'io/dropwizard/configuration/ConfigurationSourceProvider', 'io/dropwizard/configuration/EnvironmentVariableSubstitutor' }
Bytecode:
0000000: 2a2b b700 052a b600 064d 2b2c b600 072b
0000010: bb00 0859 b700 09b6 0007 2bbb 000a 59b7
0000020: 000b b600 0c2a b800 0d12 0eb6 000f bb00
0000030: 1059 b700 11b6 0012 bb00 1359 2cb7 0014
0000040: b600 12bb 0015 59b7 0016 b600 12bb 0017
0000050: 59b7 0018 b600 12bb 0019 59b7 001a b600
0000060: 12bb 001b 59b7 001c b600 1204 bd00 1d59
0000070: 0312 1e53 b600 1fb2 0020 b600 21b5 0022
0000080: 2b2a b400 22b6 0007 2bbb 0023 592b b600
0000090: 24bb 0025 5903 b700 26b7 0027 b600 282b
00000a0: bb00 2959 2ab7 002a b600 0c2b bb00 2b59
00000b0: 2ab7 002c b600 07b1
Any idea how can we fix it?