Pandas. How to reset index in a df that is resampled - pandas

Possible newbie question.
I have a df of daily stock prices;
print(df.head())
it prints the following:
High Low Open Close Volume Adj Close 100ma 250ma
Date
2015-01-02 314.750000 306.959991 312.579987 308.519989 2783200 308.519989 308.519989 308.519989
2015-01-05 308.380005 300.850006 307.010010 302.190002 2774200 302.190002 305.354996 305.354996
2015-01-06 303.000000 292.380005 302.239990 295.290009 3519000 295.290009 302.000000 302.000000
2015-01-07 301.279999 295.329987 297.500000 298.420013 2640300 298.420013 301.105003 301.105003
2015-01-08 303.140015 296.109985 300.320007 300.459991 3088400 300.459991 300.976001 300.976001
next I wanted to resample it and change it into a weekly chart:
df_ohcl = df.resample('W',loffset=pd.offsets.timedelta(days=-6)).apply({
'Open': 'first', 'High': 'max', 'Low': 'min','Close': 'last', 'Volume': 'sum'})
it gives me the right weekly values:
Open High Low Close Volume
Date
2014-12-29 312.579987 314.750000 306.959991 308.519989 2783200
2015-01-05 307.010010 308.380005 292.380005 296.929993 14614300
2015-01-12 297.559998 301.500000 285.250000 290.739990 20993900
2015-01-19 292.589996 316.929993 286.390015 312.390015 22999200
2015-01-26 311.820007 359.500000 299.329987 354.529999 41666500
I want to now move this information to matplotlib,
as well as convert the dates to the mdates version. Since I'm just going to graph the columns in Matplotlib, I actually don't want the date to be an index anymore, so I tried:
df_ohlc.reset_index(inplace=True)
get an error:
ValueError Traceback (most recent call last)
<ipython-input-149-6c0c324e68a8> in <module>
5 '''
6
----> 7 df_ohlc.reset_index(inplace=True)
8
9 df_ohcl.head()
~\anaconda3\lib\site-packages\pandas\core\frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
4602 # to ndarray and maybe infer different dtype
4603 level_values = _maybe_casted_values(lev, lab)
-> 4604 new_obj.insert(0, name, level_values)
4605
4606 new_obj.index = new_index
~\anaconda3\lib\site-packages\pandas\core\frame.py in insert(self, loc, column, value, allow_duplicates)
3494 self._ensure_valid_index(value)
3495 value = self._sanitize_column(column, value, broadcast=False)
-> 3496 self._data.insert(loc, column, value, allow_duplicates=allow_duplicates)
3497
3498 def assign(self, **kwargs) -> "DataFrame":
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in insert(self, loc, item, value, allow_duplicates)
1171 if not allow_duplicates and item in self.items:
1172 # Should this be a different kind of error??
-> 1173 raise ValueError(f"cannot insert {item}, already exists")
1174
1175 if not isinstance(loc, int):
ValueError: cannot insert ('level_0', ''), already exists
How can I fix it, so Date becomes just another column?
Thanks in advance for any help!

It can be convenient to keep the date column as an index, for plotting in MatPlotLib. Here is an example:
First, import packages and re-create the weekly data frame:
from io import StringIO
import numpy as np
import pandas as pd
data = '''Date Open High Low Close Volume
2014-12-29 312.579987 314.750000 306.959991 308.519989 2783200
2015-01-05 307.010010 308.380005 292.380005 296.929993 14614300
2015-01-12 297.559998 301.500000 285.250000 290.739990 20993900
2015-01-19 292.589996 316.929993 286.390015 312.390015 22999200
2015-01-26 311.820007 359.500000 299.329987 354.529999 41666500
'''
weekly = (pd.read_csv(StringIO(data), sep=' +', engine='python')
.assign(Date = lambda x: pd.to_datetime(x['Date'],
format='%Y-%m-%d',
errors='coerce'))
.set_index('Date'))
Open High Low Close Volume
Date
2014-12-29 312.579987 314.750000 306.959991 308.519989 2783200
2015-01-05 307.010010 308.380005 292.380005 296.929993 14614300
2015-01-12 297.559998 301.500000 285.250000 290.739990 20993900
2015-01-19 292.589996 316.929993 286.390015 312.390015 22999200
2015-01-26 311.820007 359.500000 299.329987 354.529999 41666500
Next, create the plot. The index (dates, in weekly steps) become the x-axis.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(12, 9))
for field in ['Open', 'High', 'Low', 'Close']:
ax.plot(weekly[field], label=field)
ax.legend()
plt.show();

Create column that will retain date info
df['Date'] = df.index
Set generated range that is length of DataFrame as index
df.index = range(len(df))

Related

pandas.groupby --> DatetimeIndex --> groupby year

I come from Javascript and struggle. Need to sort data by DatetimeIndex, further by the year.
CSV looks like this (i shortened it because of more than 1300 entries):
date,value
2016-05-09,1201
2017-05-10,2329
2018-05-11,1716
2019-05-12,10539
I wrote my code like this to throw away the first and last 2.5 percent of the dataframe:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
df = pd.read_csv( "fcc-forum-pageviews.csv", index_col="date", parse_dates=True).sort_values('value')
df = df.iloc[(int(round((df.count() / 100 * 2,5)[0]))):(int(round(((df.count() / 100 * 97,5)[0])-1)))]
df = df.sort_index()
Now I need to group my DatetimeIndex by years to plot it in a manner way by matplotlib. I struggle right here:
def draw_bar_plot():
df_bar = df
fig, ax = plt.subplots()
fig.figure.savefig('bar_plot.png')
return fig
I really dont know how to groupby years.
Doing something like:
print(df_bar.groupby(df_bar.index).first())
leads to:
value
date
2016-05-19 19736
2016-05-20 17491
2016-05-26 18060
2016-05-27 19997
2016-05-28 19044
... ...
2019-11-23 146658
2019-11-24 138875
2019-11-30 141161
2019-12-01 142918
2019-12-03 158549
How to group this by year? Maybe further explain how to get the data ploted by mathplotlib as a bar chart accurately.
This will group the data by year
df_year_wise_sum = df.groupby([df.index.year]).sum()
This line of code will give a bar plot
df_year_wise_sum.plot(kind='bar')
plt.savefig('bar_plot.png')
plt.show()

How to move the timestamp bounds for datetime in pandas (working with historical data)?

I'm working with historical data, and have some very old dates that are outside the timestamp bounds for pandas. I've consulted the Pandas Time series/date functionality documentation, which has some information on out of bounds spans, but from this information, it still wasn't clear to me what, if anything I could do to convert my data into a datetime type.
I've also seen a few threads on Stack Overflow on this, but they either just point out the problem (i.e. nanoseconds, max range 570-something years), or suggest setting errors = coerce which turns 80% of my data into NaTs.
Is it possible to turn dates lower than the default Pandas lower bound into dates? Here's a sample of my data:
import pandas as pd
df = pd.DataFrame({'id': ['836', '655', '508', '793', '970', '1075', '1119', '969', '1166', '893'],
'date': ['1671-11-25', '1669-11-22', '1666-05-15','1673-01-18','1675-05-07','1677-02-08','1678-02-08', '1675-02-15', '1678-11-28', '1673-12-23']})
You can create day periods by lambda function:
df['date'] = df['date'].apply(lambda x: pd.Period(x, freq='D'))
Or like mentioned #Erfan in comment (thank you):
df['date'] = df['date'].apply(pd.Period)
print (df)
id date
0 836 1671-11-25
1 655 1669-11-22
2 508 1666-05-15
3 793 1673-01-18
4 970 1675-05-07
5 1075 1677-02-08
6 1119 1678-02-08
7 969 1675-02-15
8 1166 1678-11-28
9 893 1673-12-23

How to correlate columns from two different dataframes with datetimeindex

I am trying to correlate the same column in two different dataframes (same size). The dfs use stock data with a datetimeindex. Every possible correlation I can come up with only gives NaN for an answer. Is the dtype of the df indees messing things up? Note: at this point in the program, I don't care what the dates / index actually are.
input:
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like # temp fix
import numpy as np
import fix_yahoo_finance as yf
from pandas_datareader import data, wb
from datetime import date
df1 = yf.download('IBM', start = date (2000, 1, 3), end = date (2000, 1, 5), progress = False)
df2 = yf.download('IBM', start = date (2000, 1, 6), end = date (2000, 1, 10), progress = False)
print (df1)
print (df2)
print (df1['Open'].corr(df2['Open']))
output:
Open High Low Close Adj Close Volume
Date
2000-01-03 112.4375 116.00 111.875 116.0000 81.096031 10347700
2000-01-04 114.0000 114.50 110.875 112.0625 78.343300 8227800
2000-01-05 112.9375 119.75 112.125 116.0000 81.096031 12733200
Open High Low Close Adj Close Volume
Date
2000-01-06 118.00 118.9375 113.500 114.0 79.697784 7971900
2000-01-07 117.25 117.9375 110.625 113.5 79.348267 11856700
2000-01-10 117.25 119.3750 115.375 118.0 82.494217 8540500
nan
The indexes are not matching, that's why you get nan I believe. Use numpy.corrcoef on the raw values to get your result:
np.corrcoef(df1['Open'].values,df2['Open'].values)
Output
[[ 1. -0.74615579]
[-0.74615579 1. ]]

Using set_index in time series to eliminate holiday data rows from DataFrame

I am trying to eliminate holiday data from a time series pandas DataFrame. The instructions I am following processes a DatetimeSeries and uses the function set_index() to apply this DatetimeSeries to the DataFrame which results in a time series without the holidays. This set_index() function is not working for me. Check out the code...
{data_day.tail()}
Open High Low Close Volume
Date
2018-05-20 NaN NaN NaN NaN 0.0
2018-05-21 2732.50 2739.25 2725.25 2730.50 210297692.0
2018-05-22 2726.00 2741.75 2721.50 2738.25 179224835.0
2018-05-23 2731.75 2732.75 2708.50 2710.50 292305588.0
2018-05-24 2726.00 2730.50 2705.75 2725.00 312575571.0
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
usb = CustomBusinessDay(calendar=USFederalHolidayCalendar())
usb
<CustomBusinessDay>
data_day_No_Holiday = pd.date_range(start='9/7/2005', end='5/21/2018', freq=usb)
data_day_No_Holiday
DatetimeIndex(['2005-09-07', '2005-09-08', '2005-09-09', '2005-09-12',
'2005-09-13', '2005-09-14', '2005-09-15', '2005-09-16',
'2005-09-19', '2005-09-20',
...
'2018-05-08', '2018-05-09', '2018-05-10', '2018-05-11',
'2018-05-14', '2018-05-15', '2018-05-16', '2018-05-17',
'2018-05-18', '2018-05-21'],
dtype='datetime64[ns]', length=3187, freq='C')
data_day.set_index(data_day_No_Holidays, inplace=True)
----------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-118-cf7521d08f6f> in <module>()
----> 1 data_day.set_index(data_day_No_Holidays, inplace=True)
2 # inplace=True tells python to modify the original df and to NOT create a new one.
~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
3923 index._cleanup()
3924
-> 3925 frame.index = index
3926
3927 if not inplace:
~/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
4383 try:
4384 object.__getattribute__(self, name)
-> 4385 return object.__setattr__(self, name, value)
4386 except AttributeError:
4387 pass
pandas/_libs/properties.pyx in pandas._libs.properties.AxisProperty.__set__()
~/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in _set_axis(self, axis, labels)
643
644 def _set_axis(self, axis, labels):
--> 645 self._data.set_axis(axis, labels)
646 self._clear_item_cache()
647
~/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
3321 raise ValueError(
3322 'Length mismatch: Expected axis has {old} elements, new '
-> 3323 'values have {new} elements'.format(old=old_len, new=new_len))
3324
3325 self.axes[axis] = new_labels
ValueError: Length mismatch: Expected axis has 4643 elements, new values have 3187 elements
This process seemed to work beautifully for another programmer.
Can anyone suggestion a datatype conversion or a function that will apply the DatetimeIndex to the DataFrame that will result in dropping all datarows (holidays) that are NOT represented in the data_day_No_Holiday DatetimeIndex?
Thanks, Let me know if I made any formatting errors or if I am leaving out any relevant information...
Use reindex:
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
usb = CustomBusinessDay(calendar=USFederalHolidayCalendar())
data_day_No_Holiday = pd.date_range(start='1/1/2018', end='12/31/2018', freq=usb)
data_day = pd.DataFrame({'Values':np.random.randint(0,100,365)},index = pd.date_range('2018-01-01', periods=365, freq='D'))
data_day.reindex(data_day_No_Holiday).dropna()'
Output(head):
Values
2018-01-02 38
2018-01-03 1
2018-01-04 16
2018-01-05 43
2018-01-08 95

ValueError: total size of new array must be unchanged (numpy for reshape)

I want reshape my data vector, but when I running the code
from pandas import read_csv
import numpy as np
#from pandas import Series
#from matplotlib import pyplot
series =read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)
A= np.array(series)
B = np.reshape(10,10)
print (B)
I found error
result = getattr(asarray(obj), method)(*args, **kwds)
ValueError: total size of new array must be unchanged
my data
Month xxx
1749-01 58
1749-02 62.6
1749-03 70
1749-04 55.7
1749-05 85
1749-06 83.5
1749-07 94.8
1749-08 66.3
1749-09 75.9
1749-10 75.5
1749-11 158.6
1749-12 85.2
1750-01 73.3
.... ....
.... ....
There seem to be two issues with what you are trying to do. The first relates to how you read the data in pandas:
series = read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)
print(series)
>>>>Empty DataFrame
Columns: []
Index: [1749-01 58, 1749-02 62.6, 1749-03 70, 1749-04 55.7, 1749-05 85, 1749-06 83.5, 1749-07 94.8, 1749-08 66.3, 1749-09 75.9, 1749-10 75.5, 1749-11 158.6, 1749-12 85.2, 1750-01 73.3]
This isn't giving you a column of floats in a dataframe with the dates the index, it is putting each line into the index, dates and value. I would think that you want to add delimtier=' ' so that it splits the lines properly:
series =read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, delimiter=' ', squeeze=True)
>>>> Month
1749-01-01 58.0
1749-02-01 62.6
1749-03-01 70.0
1749-04-01 55.7
1749-05-01 85.0
1749-06-01 83.5
1749-07-01 94.8
1749-08-01 66.3
1749-09-01 75.9
1749-10-01 75.5
1749-11-01 158.6
1749-12-01 85.2
1750-01-01 73.3
Name: xxx, dtype: float64
This gives you the dates as the index with the 'xxx' value in the column.
Secondly the reshape. The error is quite descriptive in this case. If you want to use numpy.reshape you can't reshape to a layout that has a different number of elements to the original data. For example:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6]) # size 6 array
a.reshape(2, 3)
>>>> [[1, 2, 3],
[4, 5, 6]]
This is fine because the array starts out length 6, and I'm reshaping to 2 x 3, and 2 x 3 = 6.
However, if I try:
a.reshape(10, 10)
>>>> ValueError: cannot reshape array of size 6 into shape (10,10)
I get the error, because I need 10 x 10 = 100 elements to do this reshape, and I only have 6.
Without the complete dataset it's impossible to know for sure, but I think this is the same problem you are having, although you are converting your whole dataframe to a numpy array.