Iterating over timeseries data in pandas - pandas

I am working with historical stock data stored in a dataframe, h2, that looks like so,
print (h2)
Open High Low Close Adj Close Volume
Date
2021-10-14 439.079987 442.660004 438.579987 442.500000 442.500000 70236800
2021-10-15 444.750000 446.260010 444.089996 445.869995 445.869995 66226800
2021-10-18 443.970001 447.549988 443.269989 447.190002 447.190002 62213200
2021-10-19 448.920013 450.709991 448.269989 450.640015 450.640015 46881100
2021-10-20 451.130005 452.730011 451.010010 451.875000 451.875000 21651910
I am trying to iterate over this data with a for loop and I get an error relating to pandas Timestamps,
for d,r in h2.iterrows():
print (d,h2[d])
KeyError: Timestamp('2021-10-14 00:00:00')
It seems that iterrows() is changing the type of the index value, so that it becomes in accessible
Is there a better way to iterate over a pandas timeseries

Related

Seasonal Decomposition plots won't show despite pandas recognizing DateTime Index

I loaded the data as follows
og_data=pd.read_excel(r'C:\Users\hp\Downloads\dataframe.xlsx',index_col='DateTime')
And it looks like this:
DateTime A
2019-02-04 10:37:54 0
2019-02-04 10:47:54 1
2019-02-04 10:57:54 2
2019-02-04 11:07:54 3
2019-02-04 11:17:54 4
Problem is, I'm trying to set the data to a frequency, but there are NaN values that I have to drop, and even if I don't, it seems to be an irregular frequency. I've got pandas to recognize the DateTime to the index:
og_data.index
DatetimeIndex
dtype='datetime64[ns]', name='DateTime', length=15536, freq=None)
but when I try doing this:
og_data.index.freq = '10T'
That should mean 10min, right?
But I get the following error instead:
ValueError: Inferred frequency None from passed values does not conform to passed frequency 10T
Even if I set the frequency to days:
og_data.index.freq = 'D'
I get a similar error.
The goal is to get a seasonal decomposition plots because I want to forecast the data. But I get the following error when I try to do so:
result=seasonal_decompose(og_data['A'],model='add')
result.plot();
ValueError: You must specify a period or x must be a pandas object with a DatetimeIndex with a freq not set to None
Which makes sense, I can't set the datetime index to a specified frequency. I need it to be every 10min, please advise.

Plotly Bar Chart Based on Pandas dataframe grouped by year

I have a pandas dataframe that I've tried to group by year on 'Close Date' and then plot 'ARR (USD)' on the y-axis against the year on the x-axis.
All seems fine after grouping:
sumyr = brandarr.groupby(brandarr['Close Date'].dt.year,as_index=True).sum()
ARR (USD)
Close Date
2017 17121174.33
2018 15383130.32
But when I try to plot:
trace = [go.Bar(
x=sumyr['Close Date'],
y=sumyr['ARR (USD)']
)]
I get the error: KeyError: 'Close Date'
I'm sure it's something stupid, I'm a newbie, but I've been messing with it for an hour and well, here I am. Thanks!
In your groupby function you have used as_index=True so Close Date is now an index. If you want to have access to an index, use pandas .loc or .iloc.
To have access to the index values directly, use:
sumyr.index.tolist()
Check here: Pandas - how to get the data frame index as an array

Resampling/interpolating/extrapolating columns of a pandas dataframe

I am interested in knowing how to interpolate/resample/extrapolate columns of a pandas dataframe for pure numerical and datetime type indices. I'd like to perform this with either straight-forward linear interpolation or spline interpolation.
Consider first a simple pandas data frame that has a numerical index (signifying time) and a couple of columns:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,2), index=np.arange(0,20,2))
print(df)
0 1
0 0.937961 0.943746
2 1.687854 0.866076
4 0.410656 -0.025926
6 -2.042386 0.956386
8 1.153727 -0.505902
10 -1.546215 0.081702
12 0.922419 0.614947
14 0.865873 -0.014047
16 0.225841 -0.831088
18 -0.048279 0.314828
I would like to resample the columns of this dataframe over some denser grid of time indices which possibly extend beyond the last time index (thus requiring extrapolation).
Denote the denser grid of indices as, for example:
t = np.arange(0,40,.6)
The interpolate method for a pandas dataframe seems to interpolate only nan's and thus requires those new indices (which may or may not coincide with the original indices) to already be part of the dataframe. I guess I could append a data frame of nans at the new indices to the original dataframe (excluding any indices appearing in both the old and new dataframes) and call interpolate and then remove the original time indices. Or, I could do everything in scipy and create a new dataframe at the desired time indices.
Is there a more direct way to do this?
In addition, I'd like to know how to do this same thing when the indices are, in fact, datetimes. That is, when, for example:
df.index = np.array('2015-07-04 02:12:40', dtype=np.datetime64) + np.arange(0,20,2)

building a DataFrame of a portfolio of symbols

I'm new to pandas.
I'm like to read the quotes for a number of symbols (e.g. ['SPY', 'IWM', 'QQQ']) from Yahoo (which I do with no problem) and then I'd like to use only the 'Adj Close' columns to build a portfolio of ETFs over a given period of time.
Say that I'd like to start with an empty DataFrame whose index are the dates where the market is open, taken for example from the first df. Subsequently, I'd like to "append" to the right one single column at a time with the 'Adj Close' of each symbol, renamed with the ticker name.
I'm sure it must be simple, but I can't get it. Can anybody help me? thank you in advance.
If you are just using the Adj Close column, it is easiest to extract it immediately after reading the data.
import pandas.io.data as web
df = web.DataReader(['F', 'AAPL', 'IBM'], 'yahoo', '2016-05-02', '2016-05-06')['Adj Close']
>>> df
AAPL F IBM
Date
2016-05-02 93.073328 13.62 143.881476
2016-05-03 94.604009 13.43 142.752373
2016-05-04 93.620002 13.31 142.871221
2016-05-05 93.239998 13.32 145.070003
2016-05-06 92.720001 13.44 147.289993

Fillna (forward fill) on a large dataframe efficiently with groupby?

What is the most efficient way to forward fill information in a large dataframe?
I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.
Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?
id start_date end_date is_current location dimensions...
xyz987 2016-03-11 2016-04-02 Expired CA lots_of_stuff
xyz987 2016-04-03 2016-04-21 Expired NaN lots_of_stuff
xyz987 2016-04-22 NaN Current CA lots_of_stuff
That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.
I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?
cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)
There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a
df.fillna(method='ffill', inplace=True)
but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).
It is likely efficient to execute the fillna directly on the groupby object:
df = df.groupby(['id']).fillna(method='ffill')
Method referenced
here
in documentation.
How about forward filling each group?
df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())
github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group.
here's an easy way to do this.
url:https://github.com/pandas-dev/pandas/issues/11296
according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:
df = df.sort_values('id')
df.ffill() * (1 - df.isnull().astype(int)).groupby('id').cumsum().applymap(lambda x: None if x == 0 else 1)