How to implement aggregation functions for pandas groupby objects? - pandas

Here's the setup for this question:
import numpy as np
import pandas as pd
import collections as co
data = [['a', 1],
['a', 2],
['a', 3],
['a', 4],
['b', 5],
['b', 6],
['b', 7]]
varnames = tuple('PQ')
df = pd.DataFrame(co.OrderedDict([(varnames[i], [row[i] for row in data])
for i in range(len(varnames))]))
gdf = df.groupby(df.ix[:, 0])
After evaluating the above, df looks like this:
>>> df
P Q
0 a 1
1 a 2
2 a 3
3 a 4
4 b 5
5 b 6
6 b 7
gdf is a DataFrameGroupBy object associated with df, where the groups are determined by the values in the first column of df.
Now, watch this:
>>> gdf.aggregate(sum)
Q
P
a 10
b 18
...but repeating the same thing after replacing sum with a pass-through wrapper for it, bombs:
>>> mysum = lambda *a, **k: sum(*a, **k)
>>> mysum(range(10)) == sum(range(10))
True
>>> gdf.aggregate(mysum)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1699, in aggregate
result = self._aggregate_generic(arg, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1757, in _aggregate_generic
return self._aggregate_item_by_item(func, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1782, in _aggregate_item_by_item
result[item] = colg.aggregate(func, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1426, in aggregate
result = self._aggregate_named(func_or_funcs, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1508, in _aggregate_named
output = func(group, *args, **kwargs)
File "<stdin>", line 1, in <lambda>
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Here's a subtler (though probably related) issue. Recall that the result of gdf.aggregate(sum) was a dataframe with a single column, Q. Now, note the result below contains two columns, P and Q:
>>> import random as rn
>>> gdf.aggregate(lambda *a, **k: rn.random())
P Q
P
a 0.344457 0.344457
b 0.990507 0.990507
I have not been able to find anything in the documentation that would explain
why should gdf.aggregate(mysum) fail? (IOW, does this failure agree with documented behavior, or is it a bug in pandas?)
why should gdf.aggregate(lambda *a, **k: rn.random()) produce a two-column output while gdf.aggregate(sum) produce a one-column output?
what signatures (input and output) should an aggregation function foo have so that gdf.aggregate(foo) will return a table having only column Q (like the result of gdf.aggregate(sum))?

Your problems all come down to the columns that are included in the GroupBy. I think you want to group by P and computed statistics on Q. To do that use
gdf = df.groupby('P')
instead of your method. Then any aggregations will not include the P column.
The sum in your function is Python's sum. Groupby.sum() is written in Cython and only acts on numeric dtypes. That's why you get the error about adding ints to strs.
Your other two questions are related to that. You're inputing two columns into gdf.agg, P and Q so you get two columns out for your gdf.aggregate(lambda *a, **k: rn.random()). gdf.sum() ignores the string column.

Related

Dropping same rows in two pandas dataframe in python

I want to have uncommon rows in two pandas dataframes. Two dataframes are df1 and wildone_df. When I check their typy both of them are "pandas.core.frame.DataFrame" but when I use below mentioned code to omit their intersection:
o = pd.concat([wildone_df,df1]).drop_duplicates(subset=None, keep='first', inplace=False)
I face following error:
TypeError Traceback (most recent call last)
<ipython-input-36-4e158c0eeb97> in <module>
----> 1 o = pd.concat([wildone_df,df1]).drop_duplicates(subset=None, keep='first', inplace=False)
5 frames
/usr/local/lib/python3.8/dist-packages/pandas/core/algorithms.py in factorize_array(values, na_sentinel, size_hint, na_value, mask)
561
562 table = hash_klass(size_hint or len(values))
--> 563 uniques, codes = table.factorize(
564 values, na_sentinel=na_sentinel, na_value=na_value, mask=mask
565 )
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()
**TypeError: unhashable type: 'numpy.ndarray'**
How can I solve this issue?!
Omitting the intersection of two dataframes
Either use inplace=True or re-assign your dataframe when using pandas.DataFrame.drop_duplicates or any other built-in function that has an inplace parameter. You can't use them both at the same time.
Returns (DataFrame or None)
DataFrame with duplicates removed or None if inplace=True.
Try this :
o = pd.concat([wildone_df, df1]).drop_duplicates() #keep="first" by default
try this:
merged_df = merged_df.loc[:,~merged_df.columns.duplicated()].copy()
See this post for more info

Pandas tells me non-ambiguous time is ambiguous

I have the following test code:
import pandas as pd
dt = pd.to_datetime('2021-11-07 01:00:00-0400').tz_convert('America/New_York')
pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
When using pandas version 1.1.5, this runs successfully. But under pandas version 1.2.5 or 1.3.4, it fails with the following error:
Traceback (most recent call last):
File "test.py", line 5, in <module>
'value': [3, 4, 5]})
File "venv/lib/python3.7/site-packages/pandas/core/frame.py", line 614, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 465, in dict_to_mgr
arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 124, in arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 590, in _homogenize
val, index, dtype=dtype, copy=False, raise_cast_failure=False
File "venv/lib/python3.7/site-packages/pandas/core/construction.py", line 514, in sanitize_array
data = construct_1d_arraylike_from_scalar(data, len(index), dtype)
File "venv/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1907, in construct_1d_arraylike_from_scalar
subarr = cls._from_sequence([value] * length, dtype=dtype)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 336, in _from_sequence
return cls._from_sequence_not_strict(scalars, dtype=dtype, copy=copy)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 362, in _from_sequence_not_strict
ambiguous=ambiguous,
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2098, in sequence_to_dt64ns
data.view("i8"), tz, ambiguous=ambiguous
File "pandas/_libs/tslibs/tzconversion.pyx", line 284, in pandas._libs.tslibs.tzconversion.tz_localize_to_utc
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2021-11-07 01:00:00, try using the 'ambiguous' argument
I am aware that Daylight Saving Time is happening on November 7. But this data looks explicit to me, and fully localized; why is pandas forgetting its timezone information, and why is it refusing to put it in a DataFrame? Is there some kind of workaround here?
Update:
I remembered that I'd actually filed a bug about this a few months ago, but it was only of somewhat academic interest to us until this week when we're starting to see actual DST-transition dates in production: https://github.com/pandas-dev/pandas/issues/42505
It's ambiguous because there are 2 dates with this special time: with DST and without DST:
# Timestamp('2021-11-07 01:00:00-0500', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=False).dst()
datetime.timedelta(0)
# Timestamp('2021-11-07 01:00:00-0400', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=True).dst()
datetime.timedelta(3600)
Workaround
dt = pd.to_datetime('2021-11-07 01:00:00-0400')
df = pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
df['datetime'] = df['datetime'].dt.tz_convert('America/New_York')
I accepted #Corralien's answer, and I also wanted to show what workaround I finally decided to go with:
# Work around Pandas DST bug, see https://github.com/pandas-dev/pandas/issues/42505 and
# https://stackoverflow.com/questions/69846645/pandas-tells-me-non-ambiguous-time-is-ambiguous
max_len = max(len(x) if self.is_array(x) else 1 for x in data.values())
if max_len > 0 and self.is_scalar(data['datetime']):
data['datetime'] = [data['datetime']] * max_len
df = pd.DataFrame(data)
The is_array() and is_scalar() functions check whether x is an instance of any of set, list, tuple, np.ndarray, pd.Series, pd.Index.
It's not perfect, but hopefully the duct tape will hold until this can be fixed in Pandas.

Why does Series.min(skipna=True) throws an error caused by na value?

I work with timestamps (having mixed DST values). Tried in Pandas 1.0.0:
s = pd.Series(
[pd.Timestamp('2020-02-01 11:35:44+01'),
np.nan, # same result with pd.Timestamp('nat')
pd.Timestamp('2019-04-13 12:10:20+02')])
Asking for min() or max() fails:
s.min(), s.max() # same result with s.min(skipna=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 11216, in stat_func
f, name, axis=axis, skipna=skipna, numeric_only=numeric_only
File "C:\Anaconda\lib\site-packages\pandas\core\series.py", line 3892, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "C:\Anaconda\lib\site-packages\pandas\core\nanops.py", line 125, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "C:\Anaconda\lib\site-packages\pandas\core\nanops.py", line 837, in reduction
result = getattr(values, meth)(axis)
File "C:\Anaconda\lib\site-packages\numpy\core\_methods.py", line 34, in _amin
return umr_minimum(a, axis, None, out, keepdims, initial, where)
TypeError: '<=' not supported between instances of 'Timestamp' and 'float'
Workaround:
s.loc[s.notna()].min(), s.loc[s.notna()].max()
(Timestamp('2019-04-13 12:10:20+0200', tz='pytz.FixedOffset(120)'), Timestamp('2020-02-01 11:35:44+0100', tz='pytz.FixedOffset(60)'))
What I am missing here? Is it a bug?
I think problem here is pandas working with Series with different timezones like objects, so max and min here failed.
s = pd.Series(
[pd.Timestamp('2020-02-01 11:35:44+01'),
np.nan, # same result with pd.Timestamp('nat')
pd.Timestamp('2019-04-13 12:10:20+02')])
print (s)
0 2020-02-01 11:35:44+01:00
1 NaN
2 2019-04-13 12:10:20+02:00
dtype: object
So if convert to datetimes (but not with mixed timezones) it working well:
print (pd.to_datetime(s, utc=True))
0 2020-02-01 10:35:44+00:00
1 NaT
2 2019-04-13 10:10:20+00:00
dtype: datetime64[ns, UTC]
print (pd.to_datetime(s, utc=True).max())
2020-02-01 10:35:44+00:00
Another possible solution if need different timezones is:
print (s.dropna().max())
2020-02-01 11:35:44+01:00

Adding Pandas series values Pandas dataframe values [duplicate]

I have a Python Pandas DataFrame:
df = pd.DataFrame(np.random.rand(5,3),columns=list('ABC'))
print df
A B C
0 0.041761178 0.60439116 0.349372206
1 0.820455992 0.245314299 0.635568504
2 0.517482167 0.7257227 0.982969949
3 0.208934899 0.594973111 0.671030326
4 0.651299752 0.617672419 0.948121305
Question:
I would like to add the first column to the whole dataframe. I would like to get this:
A B C
0 0.083522356 0.646152338 0.391133384
1 1.640911984 1.065770291 1.456024496
2 1.034964334 1.243204867 1.500452116
3 0.417869798 0.80390801 0.879965225
4 1.302599505 1.268972171 1.599421057
For the first row:
A: 0.04176 + 0.04176 = 0.08352
B: 0.04176 + 0.60439 = 0.64615
etc
Requirements:
I cannot refer to the first column using its column name.
eg.: df.A is not acceptable; df.iloc[:,0] is acceptable.
Attempt:
I tried this using:
print df.add(df.iloc[:,0], fill_value=0)
but it is not working. It returns the error message:
Traceback (most recent call last):
File "C:test.py", line 20, in <module>
print df.add(df.iloc[:,0], fill_value=0)
File "C:\python27\lib\site-packages\pandas\core\ops.py", line 771, in f
return self._combine_series(other, na_op, fill_value, axis, level)
File "C:\python27\lib\site-packages\pandas\core\frame.py", line 2939, in _combine_series
return self._combine_match_columns(other, func, level=level, fill_value=fill_value)
File "C:\python27\lib\site-packages\pandas\core\frame.py", line 2975, in _combine_match_columns
fill_value)
NotImplementedError: fill_value 0 not supported
Is it possible to take the sum of all columns of a DataFrame with the first column?
That's what you need to do:
df.add(df.A, axis=0)
Example:
>>> df = pd.DataFrame(np.random.rand(5,3),columns=['A','B','C'])
>>> col_0 = df.columns.tolist()[0]
>>> print df
A B C
0 0.502962 0.093555 0.854267
1 0.165805 0.263960 0.353374
2 0.386777 0.143079 0.063389
3 0.639575 0.269359 0.681811
4 0.874487 0.992425 0.660696
>>> df = df.add(df.col_0, axis=0)
>>> print df
A B C
0 1.005925 0.596517 1.357229
1 0.331611 0.429766 0.519179
2 0.773553 0.529855 0.450165
3 1.279151 0.908934 1.321386
4 1.748975 1.866912 1.535183
>>>
I would try something like this:
firstol = df.columns[0]
df2 = df.add(df[firstcol], axis=0)
I used a combination of the above two posts to answer this question.
Since I cannot refer to a specific column by its name, I cannot use df.add(df.A, axis=0). But this is along the correct lines. Since df += df[firstcol] produced a dataframe of NaNs, I could not use this approach, but the way that this solution obtains a list of columns from the dataframe was the trick I needed.
Here is how I did it:
col_0 = df.columns.tolist()[0]
print(df.add(df[col_0], axis=0))
You can use numpy and broadcasting for this:
df = pd.DataFrame(df.values + df['A'].values[:, None],
columns=df.columns)
I expect this to be more efficient than series-based methods.

Error in filtering groupby results in pandas

I am trying to filter groupby results in pandas using the example provided at:
http://pandas.pydata.org/pandas-docs/dev/groupby.html#filtration
but getting the following error (pandas 0.12):
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-d0014484ff78> in <module>()
1 grouped = my_df.groupby('userID')
----> 2 grouped.filter(lambda x: len(x) >= 5)
/Users/zz/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in filter(self, func, dropna, *args, **kwargs)
2092 res = path(group)
2093
-> 2094 if res:
2095 indexers.append(self.obj.index.get_indexer(group.index))
2096
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
What does it mean and how can it be resolved?
EDIT:
code to replicate the problem in pandas 0.12 stable
dff = pd.DataFrame({'A': list('222'), 'B': list('123'), 'C': list('123') })
dff.groupby('A').filter(lambda x: len(x) > 2)
This was a quasi-bug in 0.12 and will be fixed in 0.13, the res is now protected by a type check:
if isinstance(res,(bool,np.bool_)):
if res:
add_indices()
I'm not quite sure how you got this error however, the docs are actually compiled and run with actual pandas. You should ensure you're reading the docs for the correct version (in this case you were linking to dev rather than stable - although the API is largely unchanged).
The standard workaround is to do this using transform, which in this case would be something like:
In [11]: dff[g.B.transform(lambda x: len(x) > 2)]
Out[11]:
A B C
0 2 1 1
1 2 2 2
2 2 3 3