Using set_index in time series to eliminate holiday data rows from DataFrame - pandas

I am trying to eliminate holiday data from a time series pandas DataFrame. The instructions I am following processes a DatetimeSeries and uses the function set_index() to apply this DatetimeSeries to the DataFrame which results in a time series without the holidays. This set_index() function is not working for me. Check out the code...
{data_day.tail()}
Open High Low Close Volume
Date
2018-05-20 NaN NaN NaN NaN 0.0
2018-05-21 2732.50 2739.25 2725.25 2730.50 210297692.0
2018-05-22 2726.00 2741.75 2721.50 2738.25 179224835.0
2018-05-23 2731.75 2732.75 2708.50 2710.50 292305588.0
2018-05-24 2726.00 2730.50 2705.75 2725.00 312575571.0
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
usb = CustomBusinessDay(calendar=USFederalHolidayCalendar())
usb
<CustomBusinessDay>
data_day_No_Holiday = pd.date_range(start='9/7/2005', end='5/21/2018', freq=usb)
data_day_No_Holiday
DatetimeIndex(['2005-09-07', '2005-09-08', '2005-09-09', '2005-09-12',
'2005-09-13', '2005-09-14', '2005-09-15', '2005-09-16',
'2005-09-19', '2005-09-20',
...
'2018-05-08', '2018-05-09', '2018-05-10', '2018-05-11',
'2018-05-14', '2018-05-15', '2018-05-16', '2018-05-17',
'2018-05-18', '2018-05-21'],
dtype='datetime64[ns]', length=3187, freq='C')
data_day.set_index(data_day_No_Holidays, inplace=True)
----------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-118-cf7521d08f6f> in <module>()
----> 1 data_day.set_index(data_day_No_Holidays, inplace=True)
2 # inplace=True tells python to modify the original df and to NOT create a new one.
~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
3923 index._cleanup()
3924
-> 3925 frame.index = index
3926
3927 if not inplace:
~/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
4383 try:
4384 object.__getattribute__(self, name)
-> 4385 return object.__setattr__(self, name, value)
4386 except AttributeError:
4387 pass
pandas/_libs/properties.pyx in pandas._libs.properties.AxisProperty.__set__()
~/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in _set_axis(self, axis, labels)
643
644 def _set_axis(self, axis, labels):
--> 645 self._data.set_axis(axis, labels)
646 self._clear_item_cache()
647
~/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
3321 raise ValueError(
3322 'Length mismatch: Expected axis has {old} elements, new '
-> 3323 'values have {new} elements'.format(old=old_len, new=new_len))
3324
3325 self.axes[axis] = new_labels
ValueError: Length mismatch: Expected axis has 4643 elements, new values have 3187 elements
This process seemed to work beautifully for another programmer.
Can anyone suggestion a datatype conversion or a function that will apply the DatetimeIndex to the DataFrame that will result in dropping all datarows (holidays) that are NOT represented in the data_day_No_Holiday DatetimeIndex?
Thanks, Let me know if I made any formatting errors or if I am leaving out any relevant information...

Use reindex:
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
usb = CustomBusinessDay(calendar=USFederalHolidayCalendar())
data_day_No_Holiday = pd.date_range(start='1/1/2018', end='12/31/2018', freq=usb)
data_day = pd.DataFrame({'Values':np.random.randint(0,100,365)},index = pd.date_range('2018-01-01', periods=365, freq='D'))
data_day.reindex(data_day_No_Holiday).dropna()'
Output(head):
Values
2018-01-02 38
2018-01-03 1
2018-01-04 16
2018-01-05 43
2018-01-08 95

Related

Sum n values in numpy array based on pandas index

I am trying to calculate the cumulative sum of the first n values in a numpy array, where n is a value in each row of a pandas dataframe. I have set up a little example problem with a single column and it works fine, but it does not work when I have more than one column.
Example problem that fails:
a=np.ones((10,))
df=pd.DataFrame([[4.,2],[6.,1],[5.,2.]],columns=['nj','ni'])
df['nj']=df['nj'].astype(int)
df['nsum']=df.apply(lambda x: np.sum(a[:x['nj']]),axis=1)
df
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_23612/1905114001.py in <module>
2 df=pd.DataFrame([[4.,2],[6.,1],[5.,2.]],columns=['nj','ni'])
3 df['nj']=df['nj'].astype(int)
----> 4 df['nsum']=df.apply(lambda x: np.sum(a[:x['nj']]),axis=1)
5 df
C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7766 kwds=kwds,
7767 )
-> 7768 return op.get_result()
7769
7770 def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:
C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\pandas\core\apply.py in get_result(self)
183 return self.apply_raw()
184
--> 185 return self.apply_standard()
186
187 def apply_empty_result(self):
C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\pandas\core\apply.py in apply_standard(self)
274
275 def apply_standard(self):
--> 276 results, res_index = self.apply_series_generator()
277
278 # wrap results
C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
288 for i, v in enumerate(series_gen):
289 # ignore SettingWithCopy here in case the user mutates
--> 290 results[i] = self.f(v)
291 if isinstance(results[i], ABCSeries):
292 # If we have a view on v, we need to make a copy because
~\AppData\Local\Temp/ipykernel_23612/1905114001.py in <lambda>(x)
2 df=pd.DataFrame([[4.,2],[6.,1],[5.,2.]],columns=['nj','ni'])
3 df['nj']=df['nj'].astype(int)
----> 4 df['nsum']=df.apply(lambda x: np.sum(a[:x['nj']]),axis=1)
5 df
TypeError: slice indices must be integers or None or have an __index__ method
Example problem that works:
a=np.ones((10,))
df=pd.DataFrame([4.,6.,5.],columns=['nj'])
df['nj']=df['nj'].astype(int)
df['nsum']=df.apply(lambda x: np.sum(a[:x['nj']]),axis=1)
df
nj nsum
0 4 4.0
1 6 6.0
2 5 5.0
In both cases:
print(a.shape)
print(a.dtype)
print(type(df))
print(df['nj'].dtype)
(10,)
float64
<class 'pandas.core.frame.DataFrame'>
int32
A work around that is not very satisfying, especially because I would eventually like to use multiple columns in the lambda function, is:
tmp=pd.DataFrame(df['nj'])
df['nsum'] = tmp.apply(lambda x: np.sum(delr[:x['nj']]),axis=1)
Any clarification on what I have missed here or better work arounds?
IIUC, you can do it in numpy with numpy.take and numpy.cumsum:
np.take(np.cumsum(a, axis=0), df['nj'], axis=0)
A small adjustment to pass just the column of interest (df['nj']) to lambda solved my initial issue:
df['nsum'] = df['nj'].apply(lambda x: np.sum(a[:x]))
Using mozway's suggestion of np.take and np.cumsum along with a less ambiguous(?) example, the following will also work (but note the x-1 since the initial problem states "the cumulative sum of the first n values" rather than the cumulative sum to index n):
a=np.array([3,2,4,5,1,2,3])
df=pd.DataFrame([[4.,2],[6.,1],[5.,3.]],columns=['nj','ni'])
df['nj']=df['nj'].astype(int)
df[['nsumj']]=df['nj'].apply(lambda x: np.take(np.cumsum(a),x-1))
#equivalent?
# df[['nsumj']]=df['nj'].apply(lambda x: np.cumsum(a)[x-1])
print(a)
print(df)
Output:
[3 2 4 5 1 2 3]
nj ni nsumj
0 4 2.0 14
1 6 1.0 17
2 5 3.0 15
From the example here it seems the key to using multiple columns in the funtion (the next issue I was running into and hinted at) is to unpack the columns, so I will put this here incase it helps anyone:
df['nprod']=df[['ni','nj']].apply(lambda x: np.multiply(*x),axis=1)

Pandas. How to reset index in a df that is resampled

Possible newbie question.
I have a df of daily stock prices;
print(df.head())
it prints the following:
High Low Open Close Volume Adj Close 100ma 250ma
Date
2015-01-02 314.750000 306.959991 312.579987 308.519989 2783200 308.519989 308.519989 308.519989
2015-01-05 308.380005 300.850006 307.010010 302.190002 2774200 302.190002 305.354996 305.354996
2015-01-06 303.000000 292.380005 302.239990 295.290009 3519000 295.290009 302.000000 302.000000
2015-01-07 301.279999 295.329987 297.500000 298.420013 2640300 298.420013 301.105003 301.105003
2015-01-08 303.140015 296.109985 300.320007 300.459991 3088400 300.459991 300.976001 300.976001
next I wanted to resample it and change it into a weekly chart:
df_ohcl = df.resample('W',loffset=pd.offsets.timedelta(days=-6)).apply({
'Open': 'first', 'High': 'max', 'Low': 'min','Close': 'last', 'Volume': 'sum'})
it gives me the right weekly values:
Open High Low Close Volume
Date
2014-12-29 312.579987 314.750000 306.959991 308.519989 2783200
2015-01-05 307.010010 308.380005 292.380005 296.929993 14614300
2015-01-12 297.559998 301.500000 285.250000 290.739990 20993900
2015-01-19 292.589996 316.929993 286.390015 312.390015 22999200
2015-01-26 311.820007 359.500000 299.329987 354.529999 41666500
I want to now move this information to matplotlib,
as well as convert the dates to the mdates version. Since I'm just going to graph the columns in Matplotlib, I actually don't want the date to be an index anymore, so I tried:
df_ohlc.reset_index(inplace=True)
get an error:
ValueError Traceback (most recent call last)
<ipython-input-149-6c0c324e68a8> in <module>
5 '''
6
----> 7 df_ohlc.reset_index(inplace=True)
8
9 df_ohcl.head()
~\anaconda3\lib\site-packages\pandas\core\frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
4602 # to ndarray and maybe infer different dtype
4603 level_values = _maybe_casted_values(lev, lab)
-> 4604 new_obj.insert(0, name, level_values)
4605
4606 new_obj.index = new_index
~\anaconda3\lib\site-packages\pandas\core\frame.py in insert(self, loc, column, value, allow_duplicates)
3494 self._ensure_valid_index(value)
3495 value = self._sanitize_column(column, value, broadcast=False)
-> 3496 self._data.insert(loc, column, value, allow_duplicates=allow_duplicates)
3497
3498 def assign(self, **kwargs) -> "DataFrame":
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in insert(self, loc, item, value, allow_duplicates)
1171 if not allow_duplicates and item in self.items:
1172 # Should this be a different kind of error??
-> 1173 raise ValueError(f"cannot insert {item}, already exists")
1174
1175 if not isinstance(loc, int):
ValueError: cannot insert ('level_0', ''), already exists
How can I fix it, so Date becomes just another column?
Thanks in advance for any help!
It can be convenient to keep the date column as an index, for plotting in MatPlotLib. Here is an example:
First, import packages and re-create the weekly data frame:
from io import StringIO
import numpy as np
import pandas as pd
data = '''Date Open High Low Close Volume
2014-12-29 312.579987 314.750000 306.959991 308.519989 2783200
2015-01-05 307.010010 308.380005 292.380005 296.929993 14614300
2015-01-12 297.559998 301.500000 285.250000 290.739990 20993900
2015-01-19 292.589996 316.929993 286.390015 312.390015 22999200
2015-01-26 311.820007 359.500000 299.329987 354.529999 41666500
'''
weekly = (pd.read_csv(StringIO(data), sep=' +', engine='python')
.assign(Date = lambda x: pd.to_datetime(x['Date'],
format='%Y-%m-%d',
errors='coerce'))
.set_index('Date'))
Open High Low Close Volume
Date
2014-12-29 312.579987 314.750000 306.959991 308.519989 2783200
2015-01-05 307.010010 308.380005 292.380005 296.929993 14614300
2015-01-12 297.559998 301.500000 285.250000 290.739990 20993900
2015-01-19 292.589996 316.929993 286.390015 312.390015 22999200
2015-01-26 311.820007 359.500000 299.329987 354.529999 41666500
Next, create the plot. The index (dates, in weekly steps) become the x-axis.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(12, 9))
for field in ['Open', 'High', 'Low', 'Close']:
ax.plot(weekly[field], label=field)
ax.legend()
plt.show();
Create column that will retain date info
df['Date'] = df.index
Set generated range that is length of DataFrame as index
df.index = range(len(df))

How to make a plot from method read_html of Pandas on Python 2.7?

I'm trying to make a plot (whichever) and cannot see method .plot() and also i'm getting this traceback: (The data is a print of df)
[ 2019 I II III IV
Total
3373 Barrio1 1175 1117 1081 Â
8079 Barrio2 2651 2570 2858 Â
3839 Barrio232 1364 1237 1238 Â
1762 Barrio2342342 544 547 671 Â
3946 Barrio224235 1257 1291 1398 Â
Traceback (most recent call last):
File "D:/Users/str_leu/Documents/PycharmProjects/flask/graphs.py", line 13, in <module>
plt.scatter(df['barrios'], df['leuros'])
TypeError: list indices must be integers, not str
Process finished with exit code 1
and the code is:
import pandas
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
table = BeautifulSoup(open('./PycharmProjects/flask/tables.html', 'r').read(), features="lxml").find('table')
df = pandas.read_html(str(table), decimal=',', thousands='.', index_col=0)
print df
plt.scatter(df['barrios'], df['euros'])
plt.show()
UPDATED
df = pandas.read_html(str(table), decimal=',', thousands='.', index_col=2, header=1)
At the end i found how to deal with it but the problem is the last column (strange character) anyone know how to skip it?
UPDATED2
[ District2352 1.175 1.117 1.081 Unnamed: 5
3.373
8079 District23422 2651 2570 2858 NaN
3839 District7678 1364 1237 1238 NaN
1762 Distric3 544 547 671 NaN
3946 dISTRICT1 1257 1291 1398 NaN
Need to drop last column (entire) but dont know the process to pass from read_html method of pandas to DataFrame and then draw a plot...
UPDATED 3
2019 I II III IV
Total
3373 dISTRICT1 1175 1117 1081 NaN
8079 District2 2651 2570 2858 NaN
This is an example with the headers
pandas.read_html returns a list of DataFrames. Currently you're trying to access the list using an str, which is causing the error. Depending on your requirements, you can either plot columns from each using a for loop, or combine the dataframes in someway using pd.concat
import seaborn as sns
# If each dataframe holds the same columns you want to plot
dfs = pandas.read_html(str(table), decimal=',', thousands='.', index_col=0)
for df in dfs:
# you would need to individually define the plot you want
df["2019"].value_counts().plot(kind='bar')
df.plot(x='I', y='II') # etc
# you could also try seaborn's pairplot. This will omit categorical data
sns.pairplot(df)
SOLUTION
dfs = pandas.read_html(str(table), decimal=',', thousands='.', header=1, index_col=1, encoding='utf-8').pop(0)
print dfs
x=[]
y=[]
y1=[]
y2=[]
for i, row in dfs.iterrows():
x.append(row[0])
y.append(int(row[1]))
y1.append(int(row[2]))
y2.append(int(row[3]))
plt.plot(x,y)
plt.plot(x,y1)
plt.plot(x,y2)
plt.show()

Facebook-Prophet: Overflow error when fitting

I wanted to practice with prophet so I decided to download the "Yearly mean total sunspot number [1700 - now]" data from this place
http://www.sidc.be/silso/datafiles#total.
This is my code so far
import numpy as np
import matplotlib.pyplot as plt
from fbprophet import Prophet
from fbprophet.plot import plot_plotly
import plotly.offline as py
import datetime
py.init_notebook_mode()
plt.style.use('classic')
df = pd.read_csv('SN_y_tot_V2.0.csv',delimiter=';', names = ['ds', 'y','C3', 'C4', 'C5'])
df = df.drop(columns=['C3', 'C4', 'C5'])
df.plot(x="ds", style='-',figsize=(10,5))
plt.xlabel('year',fontsize=15);plt.ylabel('mean number of sunspots',fontsize=15)
plt.xticks(np.arange(1701.5, 2018.5,40))
plt.ylim(-2,300);plt.xlim(1700,2020)
plt.legend()df['ds'] = pd.to_datetime(df.ds, format='%Y')
df['ds'] = pd.to_datetime(df.ds, format='%Y')
m = Prophet(yearly_seasonality=True)
Everything looks good so far and df['ds'] is in date time format.
However when I execute
m.fit(df)
I get the following error
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
<ipython-input-57-a8e399fdfab2> in <module>()
----> 1 m.fit(df)
/anaconda2/envs/mde/lib/python3.7/site-packages/fbprophet/forecaster.py in fit(self, df, **kwargs)
1055 self.history_dates = pd.to_datetime(df['ds']).sort_values()
1056
-> 1057 history = self.setup_dataframe(history, initialize_scales=True)
1058 self.history = history
1059 self.set_auto_seasonalities()
/anaconda2/envs/mde/lib/python3.7/site-packages/fbprophet/forecaster.py in setup_dataframe(self, df, initialize_scales)
286 df['cap_scaled'] = (df['cap'] - df['floor']) / self.y_scale
287
--> 288 df['t'] = (df['ds'] - self.start) / self.t_scale
289 if 'y' in df:
290 df['y_scaled'] = (df['y'] - df['floor']) / self.y_scale
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/ops/__init__.py in wrapper(left, right)
990 # test_dt64_series_add_intlike, which the index dispatching handles
991 # specifically.
--> 992 result = dispatch_to_index_op(op, left, right, pd.DatetimeIndex)
993 return construct_result(
994 left, result, index=left.index, name=res_name, dtype=result.dtype
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/ops/__init__.py in dispatch_to_index_op(op, left, right, index_class)
628 left_idx = left_idx._shallow_copy(freq=None)
629 try:
--> 630 result = op(left_idx, right)
631 except NullFrequencyError:
632 # DatetimeIndex and TimedeltaIndex with freq == None raise ValueError
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/indexes/datetimelike.py in __sub__(self, other)
521 def __sub__(self, other):
522 # dispatch to ExtensionArray implementation
--> 523 result = self._data.__sub__(maybe_unwrap_index(other))
524 return wrap_arithmetic_op(self, other, result)
525
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/arrays/datetimelike.py in __sub__(self, other)
1278 result = self._add_offset(-other)
1279 elif isinstance(other, (datetime, np.datetime64)):
-> 1280 result = self._sub_datetimelike_scalar(other)
1281 elif lib.is_integer(other):
1282 # This check must come after the check for np.timedelta64
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in _sub_datetimelike_scalar(self, other)
856
857 i8 = self.asi8
--> 858 result = checked_add_with_arr(i8, -other.value, arr_mask=self._isnan)
859 result = self._maybe_mask_results(result)
860 return result.view("timedelta64[ns]")
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/algorithms.py in checked_add_with_arr(arr, b, arr_mask, b_mask)
1006
1007 if to_raise:
-> 1008 raise OverflowError("Overflow in int64 addition")
1009 return arr + b
1010
OverflowError: Overflow in int64 addition```
I understand that there's an issue with 'ds', but I am not sure whether there is something wring with the column's format or an open issue.
Does anyone have any idea how to fix this? I have checked some issues in github, but they haven't been of much help in this case.
Thanks
This is not an answer to fix the issue, but how to avoid the error.
I got the same error, and manage to get rid of the error when I reduce the number of data that is coming in OR when I reduce the horizon span of the forecast.
For example, I limit my training data to only start since 1825 meanwhile I have data from the year of 1700s. I also tried to limit my forecast days from 10 years forecast to only 1 year. Both managed to get rid of the error.
My guess this problem has something to do with how the ARIMA is implemented inside the Prophet itself which in some cases the number is just to huge to be managed by int64 and become overflow.

Pandas not detecting the datatype of a Series properly

I'm running into something a bit frustrating with pandas Series. I have a DataFrame with several columns, with numeric and non-numeric data. For some reason, however, pandas thinks some of the numeric columns are non-numeric, and ignores them when I try to run aggregating functions like .describe(). This is a problem, since pandas raises errors when I try to run analyses on these columns.
I've copied some commands from the terminal as an example. When I slice the 'ND_Offset' column (the problematic column in question), pandas tags it with the dtype of object. Yet, when I call .describe(), pandas tags it with the dtype float64 (which is what it should be). The 'Dwell' column, on the other hand, works exactly as it should, with pandas giving float64 both times.
Does anyone know why I'm getting this behavior?
In [83]: subject.phrases['ND_Offset'][:3]
Out[83]:
SubmitTime
2014-06-02 22:44:44 0.3607049
2014-06-02 22:44:44 0.2145484
2014-06-02 22:44:44 0.4031347
Name: ND_Offset, dtype: object
In [84]: subject.phrases['ND_Offset'].describe()
Out[84]:
count 1255.000000
unique 432.000000
top 0.242308
freq 21.000000
dtype: float64
In [85]: subject.phrases['Dwell'][:3]
Out[85]:
SubmitTime
2014-06-02 22:44:44 111
2014-06-02 22:44:44 81
2014-06-02 22:44:44 101
Name: Dwell, dtype: float64
In [86]: subject.phrases['Dwell'].describe()
Out[86]:
count 1255.000000
mean 99.013546
std 30.109327
min 21.000000
25% 81.000000
50% 94.000000
75% 111.000000
max 291.000000
dtype: float64
And when I use the .groupby function to group the data by another attribute (when these Series are a part of a DataFrame), I get the DataError: No numeric types to aggregate error when I try to call .agg(np.mean) on the group. When I try to call .agg(np.sum) on the same data, on the other hand, things work fine.
It's a bit bizarre -- can anyone explain what's going on?
Thank you!
It might be because the ND_Offset column (what I call A below) contains a non-numeric value such as an empty string. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [0.36, ''], 'B': [111, 81]})
print(df['A'].describe())
# count 2.00
# unique 2.00
# top 0.36
# freq 1.00
# dtype: float64
try:
print(df.groupby(['B']).agg(np.mean))
except Exception as err:
print(err)
# No numeric types to aggregate
print(df.groupby(['B']).agg(np.sum))
# A
# B
# 81
# 111 0.36
Aggregation using np.sum works because
In [103]: np.sum(pd.Series(['']))
Out[103]: ''
whereas np.mean(pd.Series([''])) raises
TypeError: Could not convert to numeric
To debug the problem, you could try to find the non-numeric value(s) using:
for val in df['A']:
if not isinstance(val, float):
print('Error: val = {!r}'.format(val))