missing data in pandas web.DataReader yahoo - stock

I am collecting data from yahoo, it works fine for more than 3 months. Then in last week I noticed that datas from 22.4.2021 is missing till today.
When I print the stockData, here is the last records :
...
...
2021-04-20 7.52 7.07 7.35 7.13 4119197.0 7.13
2021-04-21 7.36 7.13 7.16 7.25 3110870.0 7.25
2021-04-22 7.73 7.10 7.22 7.59 13178439.0 7.59
2021-05-06 10.08 9.48 9.52 9.93 2753885.0 9.93
As you see, datas are missing. It's because of yahoo or panda function, Could you help me to fix it ?.
Alper

Related

pandas (multi) index wrong need to change it

I have a DataFrame multiData that looks like this:
print(multiData)
Date Open High Low Close Adj Close Volume
Ticker Date
AAPL 0 2010-01-04 7.62 7.66 7.59 7.64 6.51 493729600
1 2010-01-05 7.66 7.70 7.62 7.66 6.52 601904800
2 2010-01-06 7.66 7.69 7.53 7.53 6.41 552160000
3 2010-01-07 7.56 7.57 7.47 7.52 6.40 477131200
4 2010-01-08 7.51 7.57 7.47 7.57 6.44 447610800
... ... ... ... ... ... ... ...
META 2668 2022-12-23 116.03 118.18 115.54 118.04 118.04 17796600
2669 2022-12-27 117.93 118.60 116.05 116.88 116.88 21392300
2670 2022-12-28 116.25 118.15 115.51 115.62 115.62 19612500
2671 2022-12-29 116.40 121.03 115.77 120.26 120.26 22366200
2672 2022-12-30 118.16 120.42 117.74 120.34 120.34 19492100
I need to get rid of "Date 0, 1, 2, ..." column and make the actual "Date" column part of the (multi) index
How do I do this?
Use df.droplevel to delete level 1 and chain df.set_index to add column Date to the index by setting the append parameter to True.
df = df.droplevel(1).set_index('Date', append=True)
df
Open High Low Close Adj Close Volume
Ticker Date
AAPL 2010-01-04 7.62 7.66 7.59 7.64 6.51 493729600
2010-01-05 7.66 7.70 7.62 7.66 6.52 601904800

How to merge records with aggregate historical data?

I have a table with individual records and another which holds historical information about the individuals in the former.
I want to extract information about the individuals from the second table. Both tables have timestamp. It is very important that the historical information happened before the record in the first table.
Date_Time name
0 2021-09-06 10:46:00 Leg It Liam
1 2021-09-06 10:46:00 Hollyhill Island
2 2021-09-06 10:46:00 Shani El Bolsa
3 2021-09-06 10:46:00 Kilbride Fifi
4 2021-09-06 10:46:00 Go
2100 2021-10-06 11:05:00 Slaneyside Babs
2101 2021-10-06 11:05:00 Hillview Joe
2102 2021-10-06 11:05:00 Fairway Flyer
2103 2021-10-06 11:05:00 Whiteys Surprise
2104 2021-10-06 11:05:00 Astons Lucy
The name is the variable by which you connect the two tables:
Date_Time name cc
13 2021-09-15 12:16:00 Hollyhill Island 6.00
14 2021-09-06 10:46:00 Hollyhill Island 4.50
15 2021-05-30 18:28:00 Hollyhill Island 3.50
16 2021-05-25 10:46:00 Hollyhill Island 2.50
17 2021-05-18 12:46:00 Hollyhill Island 2.38
18 2021-04-05 12:31:00 Hollyhill Island 3.50
19 2021-04-28 12:16:00 Hollyhill Island 3.75
I want to add aggregated data from this table to the first. Such as adding the cc mean and count.
Date_Time name
1 2021-09-06 10:46:00 Hollyhill Island
This line I would add 5 for cc count and 3.126 for the cc mean. Remember the historical records need to be before the date time of the individual records.
I am a bit confused how to do this efficiently. I know I need to groupby the historical data.
Also the individual records are usually in groups of Date_Time, if that makes it any easier.
IIUC:
try:
out=df1.merge(df2,on='name',suffixes=('','_y'))
#merging both df's on name
out=out.mask(out['Date_Time']<=out['Date_Time_y']).dropna()
#filtering results
out=out.groupby(['Date_Time','name'])['cc'].agg(['count','mean']).reset_index()
#aggregrating values
output of out:
Date_Time name count mean
0 2021-09-06 10:46:00 Hollyhill Island 5 3.126

Pandas resample only when makes sense

I have a time series that is very irregular. The difference in time, between two records can be 1s or 10 days.
I want to resample the data every 1h, but only when the sequential records are less than 1h.
How to approach this, without making too many loops?
In the example above, I would like to resample only rows 5-6 (delta difference is 10s) and rows 6-7 (delta difference is 50min).
The others should remain as they are.
tmp=vals[['datumtijd','filter data']]
datumtijd filter data
0 1970-11-01 00:00:00 129.0
1 1970-12-01 00:00:00 143.0
2 1971-01-05 00:00:00 151.0
3 1971-02-01 00:00:00 151.0
4 1971-03-01 00:00:00 163.0
5 1971-03-01 00:00:10 163.0
6 1971-03-01 00:00:20 163.0
7 1971-03-01 00:01:10 163.0
8 1971-03-01 00:04:10 163.0
.. ... ...
244 1981-08-19 00:00:00 102.0
245 1981-09-02 00:00:00 98.0
246 1981-09-17 00:00:00 92.0
247 1981-10-01 00:00:00 89.0
248 1981-10-19 00:00:00 92.0
You can be a little explicit about this by using groupby on the hour-floor of the time stamps:
grouped = df.groupby(df['datumtijd'].dt.floor('1H')).mean()
This is explicitly looking for the hour of each existing data point and grouping the matching ones.
But you can also just do the resample and then filter out the empty data, as pandas can still do this pretty quickly:
resampled = df.resample('1H', on='datumtijd').mean().dropna()
In either case, you get the following (note that I changed the last time stamp just so that the console would show the hours):
filter data
datumtijd
1970-11-01 00:00:00 129.0
1970-12-01 00:00:00 143.0
1971-01-05 00:00:00 151.0
1971-02-01 00:00:00 151.0
1971-03-01 00:00:00 163.0
1981-08-19 00:00:00 102.0
1981-09-02 00:00:00 98.0
1981-09-17 00:00:00 92.0
1981-10-01 00:00:00 89.0
1981-10-19 03:00:00 92.0
One quick clarification also. In your example, rows 5-8 all occur within the same hour, so they all get grouped together (hour:minute:second)!.
Also, see this related post.

Future dates calculating incorrectly in FBProphet - make_future_dataframe method

I'm trying to do a weekly forecast in FBProphet for just 5 weeks ahead. The make_future_dataframe method doesn't seem to be working right....makes the correct one week intervals except for one week between jul 3 and Jul 5....every other interval is correct at 7 days or a week. Code and output below:
INPUT DATAFRAME
ds y
548 2010-01-01 3117
547 2010-01-08 2850
546 2010-01-15 2607
545 2010-01-22 2521
544 2010-01-29 2406
... ... ...
4 2020-06-05 2807
3 2020-06-12 2892
2 2020-06-19 3012
1 2020-06-26 3077
0 2020-07-03 3133
CODE
future = m.make_future_dataframe(periods=5, freq='W')
future.tail(9)
OUTPUT
ds
545 2020-06-12
546 2020-06-19
547 2020-06-26
548 2020-07-03
549 2020-07-05
550 2020-07-12
551 2020-07-19
552 2020-07-26
553 2020-08-02
All you need to do is create a dataframe with the dates you need for predict method. utilizing the make_future_dataframe method is not necessary.

seasonal_decompose: operands could not be broadcast together with shapes on a series

I know there are many questions on this topic, but none of them helped me to solve this problem. I'm really stuck on this.
With a simple series:
0
2016-01-31 266
2016-02-29 235
2016-03-31 347
2016-04-30 514
2016-05-31 374
2016-06-30 250
2016-07-31 441
2016-08-31 422
2016-09-30 323
2016-10-31 168
2016-11-30 496
2016-12-31 303
import statsmodels.api as sm
logdf = np.log(df[0])
decompose = sm.tsa.seasonal_decompose(logdf,freq=12, model='additive')
decomplot = decompose.plot()
i keep getting: ValueError: operands could not be broadcast together with shapes (12,) (14,)
I've tried pretty much everything, passing only logdf.values, passing a non-log series. It doesn't work.
Numpy and statsmodel versions:
print(statsmodels.__version__)
print(pd.__version__)
print(np.__version__)
0.6.1
0.18.1
1.11.3
As #yoonforh pointed, in my case this was fixed by setting the freq parameter to less than the time series length. E.g. if your time series ts looks like this:
2014-01-01 0.0
2014-02-01 0.0
2014-03-01 1.0
2014-04-01 1.0
2014-05-01 0.0
2014-06-01 1.0
2014-07-01 1.0
2014-08-01 0.0
2014-09-01 0.0
2014-10-01 1.0
2014-11-01 0.0
2014-12-01 0.0
the shape is
(12,)
so this will give the error as per above:
seasonal_decompose(ts, freq=12, model='additive')
but if I try freq=11 or any other int less than 12, e.g.
seasonal_decompose(ts, freq=11, model='additive')
this works
i noticed that with newer pandas and statsmodel versions it seems to work.
Given a series:
2016-01-03 8.326275
2016-01-10 8.898229
2016-01-17 8.754792
2016-01-24 8.658172
2016-01-31 8.731659
2016-02-07 9.047233
2016-02-14 8.799662
2016-02-21 8.783549
2016-02-28 8.782783
2016-03-06 9.081825
2016-03-13 8.737934
2016-03-20 8.658693
2016-03-27 8.666475
2016-04-03 9.029178
2016-04-10 8.781555
2016-04-17 8.720787
2016-04-24 8.633909
2016-05-01 8.937744
2016-05-08 8.804925
2016-05-15 8.766862
2016-05-22 8.651899
2016-05-29 8.653645
...
And pd/sm version:
statsmodels.__version__ 0.8.0
pandas.__version__ 0.20.1
This is the result:
import statsmodels.api as sm
logdf = np.log(df_series)
decompose = sm.tsa.seasonal_decompose(logdf, model='additive', filt=None, freq=1, two_sided=True)
decompose.plot()
I hope this could solve your problem too.