How to pull EOD stock data from yahoo finance for excatly last 20 WORKING Days using Pandas in Python 2.7 - pandas

Right now what I am doing is to pull data for the last 30 days, store this in a dataframe and then pick the data for the last 20 days to use. However If one of the days in the last 20 days is a holiday, then Yahoo shows the Volume across that day as 0 and fills the OHLC(Open, High, Low, Close, Adj Close) with the Adj Close of the previous day. In the example shown below, the data for 2016-01-26 is invalid and I dont want to retreive this data.
So how do I pull data from Yahoo for excatly the last 20 working days ?
My present code is below:
from datetime import date, datetime, timedelta
import pandas_datareader.data as web
todays_date = date.today()
n = 30
date_n_days_ago = date.today() - timedelta(days=n)
yahoo_data = web.DataReader('ACC.NS', 'yahoo', date_n_days_ago, todays_date)
yahoo_data_20_day = yahoo_data.tail(20)

IIUC you can add filter, where column Volume is not 0:
from datetime import date, datetime, timedelta
import pandas_datareader.data as web
todays_date = date.today()
n = 30
date_n_days_ago = date.today() - timedelta(days=n)
yahoo_data = web.DataReader('ACC.NS', 'yahoo', date_n_days_ago, todays_date)
#add filter - get data, where column Volume is not 0
yahoo_data = yahoo_data[yahoo_data.Volume != 0]
yahoo_data_20_day = yahoo_data.tail(20)
print yahoo_data_20_day
Open High Low Close Volume Adj Close
Date
2016-01-20 1218.90 1229.00 1205.00 1212.25 156300 1206.32
2016-01-21 1225.00 1236.95 1211.25 1228.45 209200 1222.44
2016-01-22 1239.95 1256.65 1230.05 1241.00 123200 1234.93
2016-01-25 1250.00 1263.50 1241.05 1245.00 124500 1238.91
2016-01-27 1249.00 1250.00 1228.00 1230.35 112800 1224.33
2016-01-28 1232.40 1234.90 1208.00 1214.95 134500 1209.00
2016-01-29 1220.10 1253.50 1216.05 1240.05 254400 1233.98
2016-02-01 1245.00 1278.90 1240.30 1271.85 210900 1265.63
2016-02-02 1266.80 1283.00 1253.05 1261.35 204600 1255.18
2016-02-03 1244.00 1279.00 1241.45 1248.95 191000 1242.84
2016-02-04 1255.25 1277.40 1253.20 1270.40 205900 1264.18
2016-02-05 1267.05 1286.00 1259.05 1271.40 231300 1265.18
2016-02-08 1271.00 1309.75 1270.15 1280.60 218500 1274.33
2016-02-09 1271.00 1292.85 1270.00 1279.10 148600 1272.84
2016-02-10 1270.00 1278.25 1250.05 1265.85 256800 1259.66
2016-02-11 1250.00 1264.70 1225.50 1234.00 231500 1227.96
2016-02-12 1234.20 1242.65 1199.10 1221.05 212000 1215.07
2016-02-15 1230.00 1268.70 1228.35 1256.55 130800 1250.40
2016-02-16 1265.00 1273.10 1225.00 1227.80 144700 1221.79
2016-02-17 1222.80 1233.50 1204.00 1226.05 165000 1220.05

Related

Slicing pandas dataframe using index values

I'm trying to select the rows who's index values are congruent to 1 mod 24. How can I best do this?
This is my dataframe:
ticker date open high low close volume momo nextDayLogReturn
335582 ETH/USD 2021-11-05 00:00:00+00:00 4535.3 4539.3 4495.8 4507.1 9.938260e+06 9.094134 -9.160928
186854 BTC/USD 2021-11-05 00:00:00+00:00 61437.0 61528.0 61111.0 61170.0 1.191233e+07 10.640513 -10.825763
186853 BTC/USD 2021-11-04 23:00:00+00:00 61190.0 61541.0 61130.0 61437.0 1.395133e+07 10.645757 -10.842114
335581 ETH/USD 2021-11-04 23:00:00+00:00 4518.8 4539.4 4513.6 4535.3 1.296507e+07 9.087243 -9.139240
186852 BTC/USD 2021-11-04 22:00:00+00:00 61393.0 61426.0 61044.0 61190.0 1.360557e+07 10.639201 -10.812127
This was my attempt:
newindex = []
for i in range(0,df2.shape[0]+1):
if(i%24 ==1):
newindex.append(i)
df2.iloc[[newindex]]
Essentially, I need to select the rows using a boolean but i'm not sure how to do it.
Many thanks

Pandas - Take value n month before

I am working with datetime. Is there anyway to get a value of n months before.
For example, the data look like:
dft = pd.DataFrame(
np.random.randn(100, 1),
columns=["A"],
index=pd.date_range("20130101", periods=100, freq="M"),
)
dft
Then:
For every Jul of each year, we take value of December in previous year and apply it to June next year
For other month left (from Aug this year to June next year), we take value of previous month
For example: that value from Jul-2000 to June-2001 will be the same and equal to value of Dec-1999.
What I've been trying to do is:
dft['B'] = np.where(dft.index.month == 7,
dft['A'].shift(7, freq='M') ,
dft['A'].shift(1, freq='M'))
However, the result is simply a copy of column A. I don't know why. But when I tried for single line of code :
dft['C'] = dft['A'].shift(7, freq='M')
then everything is shifted as expected. I don't know what is the issue here
The issue is index alignment. This shift that you performed acts on the index, but using numpy.where you convert to arrays and lose the index.
Use pandas' where or mask instead, everything will remain as Series and the index will be preserved:
dft['B'] = (dft['A'].shift(1, freq='M')
.mask(dft.index.month == 7, dft['A'].shift(7, freq='M'))
)
output:
A B
2013-01-31 -2.202668 NaN
2013-02-28 0.878792 -2.202668
2013-03-31 -0.982540 0.878792
2013-04-30 0.119029 -0.982540
2013-05-31 -0.119644 0.119029
2013-06-30 -1.038124 -0.119644
2013-07-31 0.177794 -1.038124
2013-08-31 0.206593 -2.202668 <- correct
2013-09-30 0.188426 0.206593
2013-10-31 0.764086 0.188426
... ... ...
2020-12-31 1.382249 -1.413214
2021-01-31 -0.303696 1.382249
2021-02-28 -1.622287 -0.303696
2021-03-31 -0.763898 -1.622287
2021-04-30 0.420844 -0.763898
[100 rows x 2 columns]

Pandas group by date and get count while removing duplicates

I have a data frame that looks like this:
maid date hour count
0 023f1f5f-37fb-4869-a957-b66b111d808e 2021-08-14 13 2
1 023f1f5f-37fb-4869-a957-b66b111d808e 2021-08-14 15 1
2 0589b8a3-9d33-4db4-b94a-834cc8f46106 2021-08-13 23 14
3 0589b8a3-9d33-4db4-b94a-834cc8f46106 2021-08-14 0 1
4 104010f8-5f57-4f7c-8ad9-5fc3ec0f9f39 2021-08-11 14 2
5 11947b4a-ccf8-48dc-a6a3-925836b3c520 2021-08-13 7 1
I am trying get a count of maid's for each date in such a way that if a maid is included in day 1, I don't want to include in any of the subsequent days. For example, 0589b8a3-9d33-4db4-b94a-834cc8f46106 is present in both 13th as well as 14. I want to include the maid in the count for 13th but not on 14th as it is already included in 13th.
I have written the following code and it works for small data frames:
import pandas as pd
df=pd.read_csv('/home/ubuntu/uniqueSiteId.csv')
umaids=[]
tdf=[]
df['date']=pd.to_datetime(df.date)
df=df.sort_values('date')
df=df[['maid','date']]
df=df.drop_duplicates(['maid','date'])
dts=df['date'].unique()
for dt in dts:
if not umaids:
df1=df[df['date']==dt]
k=df1['maid'].unique()
umaids.extend(k)
dff=df1
fdf=df1.values.tolist()
elif umaids:
dfs=df[df['date']==dt]
df2=dfs[~dfs['maid'].isin(umaids)]
umaids.extend(df2['maid'].unique())
sdf=df2.values.tolist()
tdf.append(sdf)
ftdf = [item for t in tdf for item in t]
ndf=fdf+ftdf
ndf=pd.DataFrame(ndf,columns=['maid','date'])
print(ndf)
Since I have 1000's of data frames and most often my data frame is more than a million rows, the above takes a long time to run. Is there a better way to do this.
The expected output is this:
maid date
0 104010f8-5f57-4f7c-8ad9-5fc3ec0f9f39 2021-08-11
1 0589b8a3-9d33-4db4-b94a-834cc8f46106 2021-08-13
2 11947b4a-ccf8-48dc-a6a3-925836b3c520 2021-08-13
3 023f1f5f-37fb-4869-a957-b66b111d808e 2021-08-14
As per discussion in the comments, the solution is quite simple: sort the dataframe by date and then drop duplicates only by maid. This will keep the first occurence of maid, which also happens to be the first occurence in time since we sorted by date. Then do the groupby as usual.

pandas nested loops out of range of list

I am starting with a list called returnlist:
len(returnlist)
9
returnlist[0]
AAPL AMZN BAC GE GM GOOG GS SNP XOM
Date
2012-01-09 60.247143 178.559998 6.27 18.860001 22.840000 309.218842 94.690002 89.053848 85.500000
2012-01-10 60.462856 179.339996 6.63 18.719999 23.240000 309.556641 98.330002 89.430771 85.720001
2012-01-11 60.364285 178.899994 6.87 18.879999 24.469999 310.957520 99.760002 88.984619 85.080002
2012-01-12 60.198570 175.929993 6.79 18.930000 24.670000 312.785645 101.209999 87.838463 84.739998
2012-01-13 59.972858 178.419998 6.61 18.840000 24.290001 310.475647 98.959999 87.792313 84.879997
I want to get the daily leg returns and then use cumsum to get the accumulated returns.
weeklyreturns=[]
for i in range (1,10):
returns=pd.DataFrame()
for stock in returnlist[i]:
if stock not in returnlist[i]:
returns[stock]=np.log(returnlist[i][stock]).diff()
weeklyreturns.append(returns)
the error that I am getting is :
----> 4 for stock in returnlist[i]:
5 if stock not in returnlist[i]:
6 returns[stock]=np.log(returnlist[i][stock]).diff()
IndexError: list index out of range
Since len(returnlist) == 9, that means the last item of returnlist is returnlist[8].
When you iterate over range(1,10), you will start at returnlist[1] and eventually get to returnlist[9], which doesn't exist.
It seems that what you actually need is to iterate over range(0,9).

Pandas DateTimeIndex - Shifting over index

So I'm working on some technical analysis using Pandas, however I'm struggling with the DateTimeIndex, since a lot of financial data doesn't have a consistent frequency.
I use pandas_datareader to get yahoo finance data containing DateTimeIndex, Open, Close, High, Low and Volume prices. Next I'm calculating some Dates that I want to start analysing. My problem is that once I have those dates, it's really hard for me to 'access' the values corresponding to the previous and next trading day. Shift on the dataframe only works on the dataframe itself, and won't shift the indices. Shift on a DateTimeIndex would only work with a consistent frequency.
Open High Low Close Adj Close Volume
Date
2017-05-11 160.330002 160.520004 157.550003 158.539993 158.539993 5677400
2017-05-12 159.110001 160.839996 158.509995 160.809998 160.809998 5092900
2017-05-15 160.250000 161.779999 159.759995 160.020004 160.020004 4972000
2017-05-16 160.500000 161.179993 159.330002 159.410004 159.410004 3464900
2017-05-17 158.089996 158.779999 153.000000 153.199997 153.199997 8184500
2017-05-18 153.610001 156.889999 153.240005 155.699997 155.699997 6802700
2017-05-19 156.149994 158.050003 155.910004 157.020004 157.020004 4091500
2017-05-22 157.860001 158.600006 156.429993 157.160004 157.160004 3744100
2017-05-23 157.750000 158.309998 156.800003 157.949997 157.949997 3370900
2017-05-24 158.350006 158.479996 157.169998 157.750000 157.750000 2970800
So for example, given the Date 2017-05-19, I would like to be able to access the row for date 2017-05-18 as well as 2017-05-22. Not only the values, since those are still easily found using shift on the original df, but I also want to get the datetimeindex that is 'next in line'.
Any help on this problem would be greatly appreciated.
--- EDIT
I had an index 'series' that contained multiple dates, and I wanted to find the 'next rows' for each date in that series.
tmp = data.iloc[8:15, :1]
print(tmp)
h, l = momentum_gaps(data)
print(h)
print( tmp.iloc[ tmp.index.get_loc[h] ] )
This code produces the output
Open
Date
2017-05-23 157.750000
2017-05-24 158.350006
2017-05-25 161.000000
2017-05-26 162.839996
2017-05-30 163.600006
2017-05-31 163.610001
2017-06-01 163.520004
DatetimeIndex(['2017-05-25', '2017-07-12', '2017-07-18'], dtype='datetime64[ns]', name=u'Date', freq=None)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-159-a3f58efdc9d2> in <module>()
5 print(h)
6
----> 7 print( tmp.iloc[ tmp.index.get_loc[h] ] )
TypeError: 'instancemethod' object has no attribute '__getitem__'
You can use get_loc and iloc
t = '2017-05-19'
req_row = df.index.get_loc(t)
Now get the slice of the dataframe
df.iloc[[req_row-1, req_row,req_row+1]]
You get
Open High Low Close Adj_Close Volume
Date
2017-05-18 153.610001 156.889999 153.240005 155.699997 155.699997 6802700
2017-05-19 156.149994 158.050003 155.910004 157.020004 157.020004 4091500
2017-05-22 157.860001 158.600006 156.429993 157.160004 157.160004 3744100
Edit:
Say you have a series, get the indices in a list tmp.
tmp = df.iloc[4:8].index.tolist()
Now to get the next row for each date,
req_rows = [df.index.get_loc(t)+1 for t in tmp]
df.iloc[req_rows]
You get
Open High Low Close Adj_Close Volume
Date
2017-05-18 153.610001 156.889999 153.240005 155.699997 155.699997 6802700
2017-05-19 156.149994 158.050003 155.910004 157.020004 157.020004 4091500
2017-05-22 157.860001 158.600006 156.429993 157.160004 157.160004 3744100
2017-05-23 157.750000 158.309998 156.800003 157.949997 157.949997 3370900