Slicing pandas dataframe using index values - pandas

I'm trying to select the rows who's index values are congruent to 1 mod 24. How can I best do this?
This is my dataframe:
ticker date open high low close volume momo nextDayLogReturn
335582 ETH/USD 2021-11-05 00:00:00+00:00 4535.3 4539.3 4495.8 4507.1 9.938260e+06 9.094134 -9.160928
186854 BTC/USD 2021-11-05 00:00:00+00:00 61437.0 61528.0 61111.0 61170.0 1.191233e+07 10.640513 -10.825763
186853 BTC/USD 2021-11-04 23:00:00+00:00 61190.0 61541.0 61130.0 61437.0 1.395133e+07 10.645757 -10.842114
335581 ETH/USD 2021-11-04 23:00:00+00:00 4518.8 4539.4 4513.6 4535.3 1.296507e+07 9.087243 -9.139240
186852 BTC/USD 2021-11-04 22:00:00+00:00 61393.0 61426.0 61044.0 61190.0 1.360557e+07 10.639201 -10.812127
This was my attempt:
newindex = []
for i in range(0,df2.shape[0]+1):
if(i%24 ==1):
newindex.append(i)
df2.iloc[[newindex]]
Essentially, I need to select the rows using a boolean but i'm not sure how to do it.
Many thanks

Related

Pandas dataframe.resample multiple columns: max on one column, select corresponding values on another, and mean on others

I have a dataframe with several variables:
tagdata.head()
Out[128]:
Depth Temperature ... Ay Az
Time ...
2017-09-25 21:46:05 23.0 7.70 ... 0.054688 -0.691406
2017-09-25 21:46:10 24.5 6.15 ... 0.148438 -0.742188
2017-09-25 21:46:15 27.5 4.10 ... -0.078125 -0.875000
2017-09-25 21:46:20 29.0 2.55 ... 0.144531 -0.664062
2017-09-25 21:46:25 30.0 2.45 ... 0.343750 -0.886719
[5 rows x 6 columns]
I want to resample every 24H, select 1) the maximum Depth within 24H, 2) the value of temperature that corresponds to that maximum depth 3) the 24H mean for the last two columns, Ay and Az.
So far I have use the code below and it works but I would like to make the last two lines cleaner into one if possible.
Thanks!
tagdata_dailydepthmax = tagdata.resample('24H').apply(lambda tagdata: tagdata.loc[tagdata.Depth.idxmax()])
tagdata_dailydepthmax.Ay = tagdata['Ay'].resample('24H').mean()
tagdata_dailydepthmax.Az = tagdata['Az'].resample('24H').mean()
You can try this. It calculates mean for multiple columns
tagdata_dailydepthmax[['Ay','Az']] = tagdata[['Ay','Az']].resample('24H').mean()

Add a title to a dataframe

I originally had a dataframe df1,
Close
ticker AAPL AMD BIDU GOOGL IXIC
Date
2011-06-01 12.339643 8.370000 132.470001 263.063049 2769.189941
2011-06-02 12.360714 8.240000 138.490005 264.294281 2773.310059
2011-06-03 12.265714 7.970000 133.210007 261.801788 2732.780029
2011-06-06 12.072857 7.800000 126.970001 260.790802 2702.560059
2011-06-07 11.858571 7.710000 124.820000 259.774780 2701.560059
... ... ... ... ... ...
2021-05-24 127.099998 77.440002 188.960007 2361.040039 13661.169922
2021-05-25 126.900002 77.860001 192.770004 2362.870117 13657.169922
2021-05-26 126.849998 78.339996 194.880005 2380.310059 13738.000000
2021-05-27 125.279999 78.419998 194.809998 2362.679932 13736.280273
2021-05-28 124.610001 80.080002 196.270004 2356.850098 13748.740234
Due to the need for calculation, I changed the columns and created df2, which contains no Close,
ticker AAPL AMD BIDU GOOGL IXIC
Date
2011-08-25 0.760119 0.028203 0.621415 0.036067 0.993046
2011-09-23 0.648490 0.216017 0.267167 0.699657 0.562897
2011-10-21 0.442864 0.326310 0.197121 0.399332 0.048258
2011-11-18 0.333015 0.062089 0.164588 0.373293 0.015258
2011-12-19 0.101208 0.389120 0.218844 0.094759 0.116979
... ... ... ... ... ...
2021-01-12 0.437177 0.012871 0.997870 0.075802 0.137392
2021-02-10 0.064343 0.178901 0.522356 0.625447 0.320007
2021-03-11 0.135033 0.300345 0.630085 0.253857 0.466884
2021-04-09 0.358583 0.484004 0.295894 0.215424 0.454395
2021-05-07 0.124987 0.311816 0.999940 0.232552 0.281189
And now I am struggling on how to add a name to the dataframe again, say ret, because I would like to plot the histogram of each column, and would like the titles to be something like ('ret', 'AAPL')...
This may be a bit stupid and confusing, hopefully I have explained the question clearly. Thanks for any help.
you can use pd.MultiIndex.from_product() method:
df2=df2.set_index('Date')
#If 'Date' column is not your Index then make it index
df2.columns=pd.MultiIndex.from_product([['ret'],df2.columns])

How to use aggregate in dataframe of panda when multiple different column is required in the formula

data is in the following format.
open high low close volume vwap
timestamp
2015-02-25 11:05:00+05:30 1410.80 1410.80 1410.10 1410.10 75 1408.23
2015-02-25 11:06:00+05:30 1410.10 1410.95 1410.10 1410.95 44 1408.23
2015-02-25 11:07:00+05:30 1410.95 1410.95 1410.05 1410.05 57 1408.24
2015-02-25 11:08:00+05:30 1410.05 1411.00 1409.10 1410.00 511 1408.26
2015-02-25 11:09:00+05:30 1410.00 1410.05 1410.00 1410.05 176 1408.27
Want to convert timeframe.
t=data.groupby(pd.Grouper(freq='30min',origin='start')).agg({"open":"first",\
"close":"last",\
"low":"min",\
"high":"max",\
"volume":"sum",\
"vwap":lambda x: round((x['vwap']*x['volume']).sum()/x['volume'].sum())
})
Of course, the vwap part is wrong, what is the way to do it?
If use GroupBy.agg you can processing each column separately (because performance), so for processing multiple columns use GroupBy.apply and then join together by DataFrame.join:
g = data.groupby(pd.Grouper(freq='30min',origin='start'))
t=g.agg({"open":"first", "close":"last", "low":"min", "high":"max", "volume":"sum"})
f = lambda x: round((x['vwap']*x['volume']).sum()/x['volume'].sum())
t = t.join(g.apply(f).rename('vwap'))
print (t)
open close low high volume vwap
timestamp
2015-02-25 11:05:00+05:30 1410.8 1410.05 1409.1 1411.0 863 1408

pandas nested loops out of range of list

I am starting with a list called returnlist:
len(returnlist)
9
returnlist[0]
AAPL AMZN BAC GE GM GOOG GS SNP XOM
Date
2012-01-09 60.247143 178.559998 6.27 18.860001 22.840000 309.218842 94.690002 89.053848 85.500000
2012-01-10 60.462856 179.339996 6.63 18.719999 23.240000 309.556641 98.330002 89.430771 85.720001
2012-01-11 60.364285 178.899994 6.87 18.879999 24.469999 310.957520 99.760002 88.984619 85.080002
2012-01-12 60.198570 175.929993 6.79 18.930000 24.670000 312.785645 101.209999 87.838463 84.739998
2012-01-13 59.972858 178.419998 6.61 18.840000 24.290001 310.475647 98.959999 87.792313 84.879997
I want to get the daily leg returns and then use cumsum to get the accumulated returns.
weeklyreturns=[]
for i in range (1,10):
returns=pd.DataFrame()
for stock in returnlist[i]:
if stock not in returnlist[i]:
returns[stock]=np.log(returnlist[i][stock]).diff()
weeklyreturns.append(returns)
the error that I am getting is :
----> 4 for stock in returnlist[i]:
5 if stock not in returnlist[i]:
6 returns[stock]=np.log(returnlist[i][stock]).diff()
IndexError: list index out of range
Since len(returnlist) == 9, that means the last item of returnlist is returnlist[8].
When you iterate over range(1,10), you will start at returnlist[1] and eventually get to returnlist[9], which doesn't exist.
It seems that what you actually need is to iterate over range(0,9).

How to pull EOD stock data from yahoo finance for excatly last 20 WORKING Days using Pandas in Python 2.7

Right now what I am doing is to pull data for the last 30 days, store this in a dataframe and then pick the data for the last 20 days to use. However If one of the days in the last 20 days is a holiday, then Yahoo shows the Volume across that day as 0 and fills the OHLC(Open, High, Low, Close, Adj Close) with the Adj Close of the previous day. In the example shown below, the data for 2016-01-26 is invalid and I dont want to retreive this data.
So how do I pull data from Yahoo for excatly the last 20 working days ?
My present code is below:
from datetime import date, datetime, timedelta
import pandas_datareader.data as web
todays_date = date.today()
n = 30
date_n_days_ago = date.today() - timedelta(days=n)
yahoo_data = web.DataReader('ACC.NS', 'yahoo', date_n_days_ago, todays_date)
yahoo_data_20_day = yahoo_data.tail(20)
IIUC you can add filter, where column Volume is not 0:
from datetime import date, datetime, timedelta
import pandas_datareader.data as web
todays_date = date.today()
n = 30
date_n_days_ago = date.today() - timedelta(days=n)
yahoo_data = web.DataReader('ACC.NS', 'yahoo', date_n_days_ago, todays_date)
#add filter - get data, where column Volume is not 0
yahoo_data = yahoo_data[yahoo_data.Volume != 0]
yahoo_data_20_day = yahoo_data.tail(20)
print yahoo_data_20_day
Open High Low Close Volume Adj Close
Date
2016-01-20 1218.90 1229.00 1205.00 1212.25 156300 1206.32
2016-01-21 1225.00 1236.95 1211.25 1228.45 209200 1222.44
2016-01-22 1239.95 1256.65 1230.05 1241.00 123200 1234.93
2016-01-25 1250.00 1263.50 1241.05 1245.00 124500 1238.91
2016-01-27 1249.00 1250.00 1228.00 1230.35 112800 1224.33
2016-01-28 1232.40 1234.90 1208.00 1214.95 134500 1209.00
2016-01-29 1220.10 1253.50 1216.05 1240.05 254400 1233.98
2016-02-01 1245.00 1278.90 1240.30 1271.85 210900 1265.63
2016-02-02 1266.80 1283.00 1253.05 1261.35 204600 1255.18
2016-02-03 1244.00 1279.00 1241.45 1248.95 191000 1242.84
2016-02-04 1255.25 1277.40 1253.20 1270.40 205900 1264.18
2016-02-05 1267.05 1286.00 1259.05 1271.40 231300 1265.18
2016-02-08 1271.00 1309.75 1270.15 1280.60 218500 1274.33
2016-02-09 1271.00 1292.85 1270.00 1279.10 148600 1272.84
2016-02-10 1270.00 1278.25 1250.05 1265.85 256800 1259.66
2016-02-11 1250.00 1264.70 1225.50 1234.00 231500 1227.96
2016-02-12 1234.20 1242.65 1199.10 1221.05 212000 1215.07
2016-02-15 1230.00 1268.70 1228.35 1256.55 130800 1250.40
2016-02-16 1265.00 1273.10 1225.00 1227.80 144700 1221.79
2016-02-17 1222.80 1233.50 1204.00 1226.05 165000 1220.05