Sum n values in numpy array based on pandas index - pandas

I am trying to calculate the cumulative sum of the first n values in a numpy array, where n is a value in each row of a pandas dataframe. I have set up a little example problem with a single column and it works fine, but it does not work when I have more than one column.
Example problem that fails:
a=np.ones((10,))
df=pd.DataFrame([[4.,2],[6.,1],[5.,2.]],columns=['nj','ni'])
df['nj']=df['nj'].astype(int)
df['nsum']=df.apply(lambda x: np.sum(a[:x['nj']]),axis=1)
df
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_23612/1905114001.py in <module>
2 df=pd.DataFrame([[4.,2],[6.,1],[5.,2.]],columns=['nj','ni'])
3 df['nj']=df['nj'].astype(int)
----> 4 df['nsum']=df.apply(lambda x: np.sum(a[:x['nj']]),axis=1)
5 df
C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7766 kwds=kwds,
7767 )
-> 7768 return op.get_result()
7769
7770 def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:
C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\pandas\core\apply.py in get_result(self)
183 return self.apply_raw()
184
--> 185 return self.apply_standard()
186
187 def apply_empty_result(self):
C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\pandas\core\apply.py in apply_standard(self)
274
275 def apply_standard(self):
--> 276 results, res_index = self.apply_series_generator()
277
278 # wrap results
C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
288 for i, v in enumerate(series_gen):
289 # ignore SettingWithCopy here in case the user mutates
--> 290 results[i] = self.f(v)
291 if isinstance(results[i], ABCSeries):
292 # If we have a view on v, we need to make a copy because
~\AppData\Local\Temp/ipykernel_23612/1905114001.py in <lambda>(x)
2 df=pd.DataFrame([[4.,2],[6.,1],[5.,2.]],columns=['nj','ni'])
3 df['nj']=df['nj'].astype(int)
----> 4 df['nsum']=df.apply(lambda x: np.sum(a[:x['nj']]),axis=1)
5 df
TypeError: slice indices must be integers or None or have an __index__ method
Example problem that works:
a=np.ones((10,))
df=pd.DataFrame([4.,6.,5.],columns=['nj'])
df['nj']=df['nj'].astype(int)
df['nsum']=df.apply(lambda x: np.sum(a[:x['nj']]),axis=1)
df
nj nsum
0 4 4.0
1 6 6.0
2 5 5.0
In both cases:
print(a.shape)
print(a.dtype)
print(type(df))
print(df['nj'].dtype)
(10,)
float64
<class 'pandas.core.frame.DataFrame'>
int32
A work around that is not very satisfying, especially because I would eventually like to use multiple columns in the lambda function, is:
tmp=pd.DataFrame(df['nj'])
df['nsum'] = tmp.apply(lambda x: np.sum(delr[:x['nj']]),axis=1)
Any clarification on what I have missed here or better work arounds?

IIUC, you can do it in numpy with numpy.take and numpy.cumsum:
np.take(np.cumsum(a, axis=0), df['nj'], axis=0)

A small adjustment to pass just the column of interest (df['nj']) to lambda solved my initial issue:
df['nsum'] = df['nj'].apply(lambda x: np.sum(a[:x]))
Using mozway's suggestion of np.take and np.cumsum along with a less ambiguous(?) example, the following will also work (but note the x-1 since the initial problem states "the cumulative sum of the first n values" rather than the cumulative sum to index n):
a=np.array([3,2,4,5,1,2,3])
df=pd.DataFrame([[4.,2],[6.,1],[5.,3.]],columns=['nj','ni'])
df['nj']=df['nj'].astype(int)
df[['nsumj']]=df['nj'].apply(lambda x: np.take(np.cumsum(a),x-1))
#equivalent?
# df[['nsumj']]=df['nj'].apply(lambda x: np.cumsum(a)[x-1])
print(a)
print(df)
Output:
[3 2 4 5 1 2 3]
nj ni nsumj
0 4 2.0 14
1 6 1.0 17
2 5 3.0 15
From the example here it seems the key to using multiple columns in the funtion (the next issue I was running into and hinted at) is to unpack the columns, so I will put this here incase it helps anyone:
df['nprod']=df[['ni','nj']].apply(lambda x: np.multiply(*x),axis=1)

Related

Iterating Rows in DataFrame and Applying difflib.ratio()

Context of Problem
I am working on a project where I would like to compare two columns from a dataframe to determine what percent of the strings are similar to each other. Specifically, I'm comparing whether bullets scraped from retailer websites match the bullets that I expect to see on those sites for a given product.
I know that I can simply use boolean logic to determine if the value from column ['X'] == column ['Y']. But I'd like to take it to another level and determine what percentage of X matches Y. I did some research and found that difflib.ratio() can accomplish what I want.
Example of difflib.ratio()
a = 'preview'
b = 'previeu'
SequenceMatcher(a=a, b=b).ratio()
My Use Case
Where I'm having trouble is applying this logic to iterate through a DataFrame. This is what my DataFrame looks like.
DataFrame
The DataFrame has 5 "Bullets" and 5 "SEO Bullets". So I tried using a for loop to apply a lambda function to my DataFrame called test.
for x in range(1,6):
test[f'Bullet {x} Ratio'] = test.apply(lambda row: SequenceMatcher(a=row[f'SeoBullet_{x}'], b=row[f'Bullet {x}']).ratio())
But I received the following error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-409-39a6ba3c8879> in <module>
1 for x in range(1,6):
----> 2 test[f'Bullet {x} Ratio'] = test.apply(lambda row: SequenceMatcher(a=row[f'SeoBullet_{x}'], b=row[f'Bullet {x}']).ratio())
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7539 kwds=kwds,
7540 )
-> 7541 return op.get_result()
7542
7543 def applymap(self, func) -> "DataFrame":
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\apply.py in get_result(self)
178 return self.apply_raw()
179
--> 180 return self.apply_standard()
181
182 def apply_empty_result(self):
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\apply.py in apply_standard(self)
253
254 def apply_standard(self):
--> 255 results, res_index = self.apply_series_generator()
256
257 # wrap results
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
282 for i, v in enumerate(series_gen):
283 # ignore SettingWithCopy here in case the user mutates
--> 284 results[i] = self.f(v)
285 if isinstance(results[i], ABCSeries):
286 # If we have a view on v, we need to make a copy because
<ipython-input-409-39a6ba3c8879> in <lambda>(row)
1 for x in range(1,6):
----> 2 test[f'Bullet {x} Ratio'] = test.apply(lambda row: SequenceMatcher(a=row[f'SeoBullet_{x}'], b=row[f'Bullet {x}']).ratio())
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
880
881 elif key_is_scalar:
--> 882 return self._get_value(key)
883
884 if (
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable)
989
990 # Similar to Index.get_value, but we do not fall back to positional
--> 991 loc = self.index.get_loc(label)
992 return self.index._get_values_for_loc(self, loc, label)
993
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\indexes\range.py in get_loc(self, key, method, tolerance)
352 except ValueError as err:
353 raise KeyError(key) from err
--> 354 raise KeyError(key)
355 return super().get_loc(key, method=method, tolerance=tolerance)
356
KeyError: 'SeoBullet_1'
Desired Output
Ideally, the final output would be a dataframe that has 5 additional columns with the ratios for each Bullet comparison.
I'm still new-ish to Python, so I could just naïve and missing something very obvious. I say this also to say that if there is another route I could go to accomplish the same thing (or something very similar) I am open to those suggestions.

How to make a plot from method read_html of Pandas on Python 2.7?

I'm trying to make a plot (whichever) and cannot see method .plot() and also i'm getting this traceback: (The data is a print of df)
[ 2019 I II III IV
Total
3373 Barrio1 1175 1117 1081 Â
8079 Barrio2 2651 2570 2858 Â
3839 Barrio232 1364 1237 1238 Â
1762 Barrio2342342 544 547 671 Â
3946 Barrio224235 1257 1291 1398 Â
Traceback (most recent call last):
File "D:/Users/str_leu/Documents/PycharmProjects/flask/graphs.py", line 13, in <module>
plt.scatter(df['barrios'], df['leuros'])
TypeError: list indices must be integers, not str
Process finished with exit code 1
and the code is:
import pandas
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
table = BeautifulSoup(open('./PycharmProjects/flask/tables.html', 'r').read(), features="lxml").find('table')
df = pandas.read_html(str(table), decimal=',', thousands='.', index_col=0)
print df
plt.scatter(df['barrios'], df['euros'])
plt.show()
UPDATED
df = pandas.read_html(str(table), decimal=',', thousands='.', index_col=2, header=1)
At the end i found how to deal with it but the problem is the last column (strange character) anyone know how to skip it?
UPDATED2
[ District2352 1.175 1.117 1.081 Unnamed: 5
3.373
8079 District23422 2651 2570 2858 NaN
3839 District7678 1364 1237 1238 NaN
1762 Distric3 544 547 671 NaN
3946 dISTRICT1 1257 1291 1398 NaN
Need to drop last column (entire) but dont know the process to pass from read_html method of pandas to DataFrame and then draw a plot...
UPDATED 3
2019 I II III IV
Total
3373 dISTRICT1 1175 1117 1081 NaN
8079 District2 2651 2570 2858 NaN
This is an example with the headers
pandas.read_html returns a list of DataFrames. Currently you're trying to access the list using an str, which is causing the error. Depending on your requirements, you can either plot columns from each using a for loop, or combine the dataframes in someway using pd.concat
import seaborn as sns
# If each dataframe holds the same columns you want to plot
dfs = pandas.read_html(str(table), decimal=',', thousands='.', index_col=0)
for df in dfs:
# you would need to individually define the plot you want
df["2019"].value_counts().plot(kind='bar')
df.plot(x='I', y='II') # etc
# you could also try seaborn's pairplot. This will omit categorical data
sns.pairplot(df)
SOLUTION
dfs = pandas.read_html(str(table), decimal=',', thousands='.', header=1, index_col=1, encoding='utf-8').pop(0)
print dfs
x=[]
y=[]
y1=[]
y2=[]
for i, row in dfs.iterrows():
x.append(row[0])
y.append(int(row[1]))
y1.append(int(row[2]))
y2.append(int(row[3]))
plt.plot(x,y)
plt.plot(x,y1)
plt.plot(x,y2)
plt.show()

Facebook-Prophet: Overflow error when fitting

I wanted to practice with prophet so I decided to download the "Yearly mean total sunspot number [1700 - now]" data from this place
http://www.sidc.be/silso/datafiles#total.
This is my code so far
import numpy as np
import matplotlib.pyplot as plt
from fbprophet import Prophet
from fbprophet.plot import plot_plotly
import plotly.offline as py
import datetime
py.init_notebook_mode()
plt.style.use('classic')
df = pd.read_csv('SN_y_tot_V2.0.csv',delimiter=';', names = ['ds', 'y','C3', 'C4', 'C5'])
df = df.drop(columns=['C3', 'C4', 'C5'])
df.plot(x="ds", style='-',figsize=(10,5))
plt.xlabel('year',fontsize=15);plt.ylabel('mean number of sunspots',fontsize=15)
plt.xticks(np.arange(1701.5, 2018.5,40))
plt.ylim(-2,300);plt.xlim(1700,2020)
plt.legend()df['ds'] = pd.to_datetime(df.ds, format='%Y')
df['ds'] = pd.to_datetime(df.ds, format='%Y')
m = Prophet(yearly_seasonality=True)
Everything looks good so far and df['ds'] is in date time format.
However when I execute
m.fit(df)
I get the following error
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
<ipython-input-57-a8e399fdfab2> in <module>()
----> 1 m.fit(df)
/anaconda2/envs/mde/lib/python3.7/site-packages/fbprophet/forecaster.py in fit(self, df, **kwargs)
1055 self.history_dates = pd.to_datetime(df['ds']).sort_values()
1056
-> 1057 history = self.setup_dataframe(history, initialize_scales=True)
1058 self.history = history
1059 self.set_auto_seasonalities()
/anaconda2/envs/mde/lib/python3.7/site-packages/fbprophet/forecaster.py in setup_dataframe(self, df, initialize_scales)
286 df['cap_scaled'] = (df['cap'] - df['floor']) / self.y_scale
287
--> 288 df['t'] = (df['ds'] - self.start) / self.t_scale
289 if 'y' in df:
290 df['y_scaled'] = (df['y'] - df['floor']) / self.y_scale
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/ops/__init__.py in wrapper(left, right)
990 # test_dt64_series_add_intlike, which the index dispatching handles
991 # specifically.
--> 992 result = dispatch_to_index_op(op, left, right, pd.DatetimeIndex)
993 return construct_result(
994 left, result, index=left.index, name=res_name, dtype=result.dtype
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/ops/__init__.py in dispatch_to_index_op(op, left, right, index_class)
628 left_idx = left_idx._shallow_copy(freq=None)
629 try:
--> 630 result = op(left_idx, right)
631 except NullFrequencyError:
632 # DatetimeIndex and TimedeltaIndex with freq == None raise ValueError
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/indexes/datetimelike.py in __sub__(self, other)
521 def __sub__(self, other):
522 # dispatch to ExtensionArray implementation
--> 523 result = self._data.__sub__(maybe_unwrap_index(other))
524 return wrap_arithmetic_op(self, other, result)
525
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/arrays/datetimelike.py in __sub__(self, other)
1278 result = self._add_offset(-other)
1279 elif isinstance(other, (datetime, np.datetime64)):
-> 1280 result = self._sub_datetimelike_scalar(other)
1281 elif lib.is_integer(other):
1282 # This check must come after the check for np.timedelta64
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in _sub_datetimelike_scalar(self, other)
856
857 i8 = self.asi8
--> 858 result = checked_add_with_arr(i8, -other.value, arr_mask=self._isnan)
859 result = self._maybe_mask_results(result)
860 return result.view("timedelta64[ns]")
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/algorithms.py in checked_add_with_arr(arr, b, arr_mask, b_mask)
1006
1007 if to_raise:
-> 1008 raise OverflowError("Overflow in int64 addition")
1009 return arr + b
1010
OverflowError: Overflow in int64 addition```
I understand that there's an issue with 'ds', but I am not sure whether there is something wring with the column's format or an open issue.
Does anyone have any idea how to fix this? I have checked some issues in github, but they haven't been of much help in this case.
Thanks
This is not an answer to fix the issue, but how to avoid the error.
I got the same error, and manage to get rid of the error when I reduce the number of data that is coming in OR when I reduce the horizon span of the forecast.
For example, I limit my training data to only start since 1825 meanwhile I have data from the year of 1700s. I also tried to limit my forecast days from 10 years forecast to only 1 year. Both managed to get rid of the error.
My guess this problem has something to do with how the ARIMA is implemented inside the Prophet itself which in some cases the number is just to huge to be managed by int64 and become overflow.

Using set_index in time series to eliminate holiday data rows from DataFrame

I am trying to eliminate holiday data from a time series pandas DataFrame. The instructions I am following processes a DatetimeSeries and uses the function set_index() to apply this DatetimeSeries to the DataFrame which results in a time series without the holidays. This set_index() function is not working for me. Check out the code...
{data_day.tail()}
Open High Low Close Volume
Date
2018-05-20 NaN NaN NaN NaN 0.0
2018-05-21 2732.50 2739.25 2725.25 2730.50 210297692.0
2018-05-22 2726.00 2741.75 2721.50 2738.25 179224835.0
2018-05-23 2731.75 2732.75 2708.50 2710.50 292305588.0
2018-05-24 2726.00 2730.50 2705.75 2725.00 312575571.0
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
usb = CustomBusinessDay(calendar=USFederalHolidayCalendar())
usb
<CustomBusinessDay>
data_day_No_Holiday = pd.date_range(start='9/7/2005', end='5/21/2018', freq=usb)
data_day_No_Holiday
DatetimeIndex(['2005-09-07', '2005-09-08', '2005-09-09', '2005-09-12',
'2005-09-13', '2005-09-14', '2005-09-15', '2005-09-16',
'2005-09-19', '2005-09-20',
...
'2018-05-08', '2018-05-09', '2018-05-10', '2018-05-11',
'2018-05-14', '2018-05-15', '2018-05-16', '2018-05-17',
'2018-05-18', '2018-05-21'],
dtype='datetime64[ns]', length=3187, freq='C')
data_day.set_index(data_day_No_Holidays, inplace=True)
----------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-118-cf7521d08f6f> in <module>()
----> 1 data_day.set_index(data_day_No_Holidays, inplace=True)
2 # inplace=True tells python to modify the original df and to NOT create a new one.
~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
3923 index._cleanup()
3924
-> 3925 frame.index = index
3926
3927 if not inplace:
~/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
4383 try:
4384 object.__getattribute__(self, name)
-> 4385 return object.__setattr__(self, name, value)
4386 except AttributeError:
4387 pass
pandas/_libs/properties.pyx in pandas._libs.properties.AxisProperty.__set__()
~/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in _set_axis(self, axis, labels)
643
644 def _set_axis(self, axis, labels):
--> 645 self._data.set_axis(axis, labels)
646 self._clear_item_cache()
647
~/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
3321 raise ValueError(
3322 'Length mismatch: Expected axis has {old} elements, new '
-> 3323 'values have {new} elements'.format(old=old_len, new=new_len))
3324
3325 self.axes[axis] = new_labels
ValueError: Length mismatch: Expected axis has 4643 elements, new values have 3187 elements
This process seemed to work beautifully for another programmer.
Can anyone suggestion a datatype conversion or a function that will apply the DatetimeIndex to the DataFrame that will result in dropping all datarows (holidays) that are NOT represented in the data_day_No_Holiday DatetimeIndex?
Thanks, Let me know if I made any formatting errors or if I am leaving out any relevant information...
Use reindex:
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
usb = CustomBusinessDay(calendar=USFederalHolidayCalendar())
data_day_No_Holiday = pd.date_range(start='1/1/2018', end='12/31/2018', freq=usb)
data_day = pd.DataFrame({'Values':np.random.randint(0,100,365)},index = pd.date_range('2018-01-01', periods=365, freq='D'))
data_day.reindex(data_day_No_Holiday).dropna()'
Output(head):
Values
2018-01-02 38
2018-01-03 1
2018-01-04 16
2018-01-05 43
2018-01-08 95

Error in filtering groupby results in pandas

I am trying to filter groupby results in pandas using the example provided at:
http://pandas.pydata.org/pandas-docs/dev/groupby.html#filtration
but getting the following error (pandas 0.12):
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-d0014484ff78> in <module>()
1 grouped = my_df.groupby('userID')
----> 2 grouped.filter(lambda x: len(x) >= 5)
/Users/zz/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in filter(self, func, dropna, *args, **kwargs)
2092 res = path(group)
2093
-> 2094 if res:
2095 indexers.append(self.obj.index.get_indexer(group.index))
2096
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
What does it mean and how can it be resolved?
EDIT:
code to replicate the problem in pandas 0.12 stable
dff = pd.DataFrame({'A': list('222'), 'B': list('123'), 'C': list('123') })
dff.groupby('A').filter(lambda x: len(x) > 2)
This was a quasi-bug in 0.12 and will be fixed in 0.13, the res is now protected by a type check:
if isinstance(res,(bool,np.bool_)):
if res:
add_indices()
I'm not quite sure how you got this error however, the docs are actually compiled and run with actual pandas. You should ensure you're reading the docs for the correct version (in this case you were linking to dev rather than stable - although the API is largely unchanged).
The standard workaround is to do this using transform, which in this case would be something like:
In [11]: dff[g.B.transform(lambda x: len(x) > 2)]
Out[11]:
A B C
0 2 1 1
1 2 2 2
2 2 3 3