Error in filtering groupby results in pandas - pandas

I am trying to filter groupby results in pandas using the example provided at:
http://pandas.pydata.org/pandas-docs/dev/groupby.html#filtration
but getting the following error (pandas 0.12):
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-d0014484ff78> in <module>()
1 grouped = my_df.groupby('userID')
----> 2 grouped.filter(lambda x: len(x) >= 5)
/Users/zz/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in filter(self, func, dropna, *args, **kwargs)
2092 res = path(group)
2093
-> 2094 if res:
2095 indexers.append(self.obj.index.get_indexer(group.index))
2096
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
What does it mean and how can it be resolved?
EDIT:
code to replicate the problem in pandas 0.12 stable
dff = pd.DataFrame({'A': list('222'), 'B': list('123'), 'C': list('123') })
dff.groupby('A').filter(lambda x: len(x) > 2)

This was a quasi-bug in 0.12 and will be fixed in 0.13, the res is now protected by a type check:
if isinstance(res,(bool,np.bool_)):
if res:
add_indices()
I'm not quite sure how you got this error however, the docs are actually compiled and run with actual pandas. You should ensure you're reading the docs for the correct version (in this case you were linking to dev rather than stable - although the API is largely unchanged).
The standard workaround is to do this using transform, which in this case would be something like:
In [11]: dff[g.B.transform(lambda x: len(x) > 2)]
Out[11]:
A B C
0 2 1 1
1 2 2 2
2 2 3 3

Related

Dropping same rows in two pandas dataframe in python

I want to have uncommon rows in two pandas dataframes. Two dataframes are df1 and wildone_df. When I check their typy both of them are "pandas.core.frame.DataFrame" but when I use below mentioned code to omit their intersection:
o = pd.concat([wildone_df,df1]).drop_duplicates(subset=None, keep='first', inplace=False)
I face following error:
TypeError Traceback (most recent call last)
<ipython-input-36-4e158c0eeb97> in <module>
----> 1 o = pd.concat([wildone_df,df1]).drop_duplicates(subset=None, keep='first', inplace=False)
5 frames
/usr/local/lib/python3.8/dist-packages/pandas/core/algorithms.py in factorize_array(values, na_sentinel, size_hint, na_value, mask)
561
562 table = hash_klass(size_hint or len(values))
--> 563 uniques, codes = table.factorize(
564 values, na_sentinel=na_sentinel, na_value=na_value, mask=mask
565 )
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()
**TypeError: unhashable type: 'numpy.ndarray'**
How can I solve this issue?!
Omitting the intersection of two dataframes
Either use inplace=True or re-assign your dataframe when using pandas.DataFrame.drop_duplicates or any other built-in function that has an inplace parameter. You can't use them both at the same time.
Returns (DataFrame or None)
DataFrame with duplicates removed or None if inplace=True.
Try this :
o = pd.concat([wildone_df, df1]).drop_duplicates() #keep="first" by default
try this:
merged_df = merged_df.loc[:,~merged_df.columns.duplicated()].copy()
See this post for more info

Sum n values in numpy array based on pandas index

I am trying to calculate the cumulative sum of the first n values in a numpy array, where n is a value in each row of a pandas dataframe. I have set up a little example problem with a single column and it works fine, but it does not work when I have more than one column.
Example problem that fails:
a=np.ones((10,))
df=pd.DataFrame([[4.,2],[6.,1],[5.,2.]],columns=['nj','ni'])
df['nj']=df['nj'].astype(int)
df['nsum']=df.apply(lambda x: np.sum(a[:x['nj']]),axis=1)
df
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_23612/1905114001.py in <module>
2 df=pd.DataFrame([[4.,2],[6.,1],[5.,2.]],columns=['nj','ni'])
3 df['nj']=df['nj'].astype(int)
----> 4 df['nsum']=df.apply(lambda x: np.sum(a[:x['nj']]),axis=1)
5 df
C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7766 kwds=kwds,
7767 )
-> 7768 return op.get_result()
7769
7770 def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:
C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\pandas\core\apply.py in get_result(self)
183 return self.apply_raw()
184
--> 185 return self.apply_standard()
186
187 def apply_empty_result(self):
C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\pandas\core\apply.py in apply_standard(self)
274
275 def apply_standard(self):
--> 276 results, res_index = self.apply_series_generator()
277
278 # wrap results
C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
288 for i, v in enumerate(series_gen):
289 # ignore SettingWithCopy here in case the user mutates
--> 290 results[i] = self.f(v)
291 if isinstance(results[i], ABCSeries):
292 # If we have a view on v, we need to make a copy because
~\AppData\Local\Temp/ipykernel_23612/1905114001.py in <lambda>(x)
2 df=pd.DataFrame([[4.,2],[6.,1],[5.,2.]],columns=['nj','ni'])
3 df['nj']=df['nj'].astype(int)
----> 4 df['nsum']=df.apply(lambda x: np.sum(a[:x['nj']]),axis=1)
5 df
TypeError: slice indices must be integers or None or have an __index__ method
Example problem that works:
a=np.ones((10,))
df=pd.DataFrame([4.,6.,5.],columns=['nj'])
df['nj']=df['nj'].astype(int)
df['nsum']=df.apply(lambda x: np.sum(a[:x['nj']]),axis=1)
df
nj nsum
0 4 4.0
1 6 6.0
2 5 5.0
In both cases:
print(a.shape)
print(a.dtype)
print(type(df))
print(df['nj'].dtype)
(10,)
float64
<class 'pandas.core.frame.DataFrame'>
int32
A work around that is not very satisfying, especially because I would eventually like to use multiple columns in the lambda function, is:
tmp=pd.DataFrame(df['nj'])
df['nsum'] = tmp.apply(lambda x: np.sum(delr[:x['nj']]),axis=1)
Any clarification on what I have missed here or better work arounds?
IIUC, you can do it in numpy with numpy.take and numpy.cumsum:
np.take(np.cumsum(a, axis=0), df['nj'], axis=0)
A small adjustment to pass just the column of interest (df['nj']) to lambda solved my initial issue:
df['nsum'] = df['nj'].apply(lambda x: np.sum(a[:x]))
Using mozway's suggestion of np.take and np.cumsum along with a less ambiguous(?) example, the following will also work (but note the x-1 since the initial problem states "the cumulative sum of the first n values" rather than the cumulative sum to index n):
a=np.array([3,2,4,5,1,2,3])
df=pd.DataFrame([[4.,2],[6.,1],[5.,3.]],columns=['nj','ni'])
df['nj']=df['nj'].astype(int)
df[['nsumj']]=df['nj'].apply(lambda x: np.take(np.cumsum(a),x-1))
#equivalent?
# df[['nsumj']]=df['nj'].apply(lambda x: np.cumsum(a)[x-1])
print(a)
print(df)
Output:
[3 2 4 5 1 2 3]
nj ni nsumj
0 4 2.0 14
1 6 1.0 17
2 5 3.0 15
From the example here it seems the key to using multiple columns in the funtion (the next issue I was running into and hinted at) is to unpack the columns, so I will put this here incase it helps anyone:
df['nprod']=df[['ni','nj']].apply(lambda x: np.multiply(*x),axis=1)

Adding Pandas series values Pandas dataframe values [duplicate]

I have a Python Pandas DataFrame:
df = pd.DataFrame(np.random.rand(5,3),columns=list('ABC'))
print df
A B C
0 0.041761178 0.60439116 0.349372206
1 0.820455992 0.245314299 0.635568504
2 0.517482167 0.7257227 0.982969949
3 0.208934899 0.594973111 0.671030326
4 0.651299752 0.617672419 0.948121305
Question:
I would like to add the first column to the whole dataframe. I would like to get this:
A B C
0 0.083522356 0.646152338 0.391133384
1 1.640911984 1.065770291 1.456024496
2 1.034964334 1.243204867 1.500452116
3 0.417869798 0.80390801 0.879965225
4 1.302599505 1.268972171 1.599421057
For the first row:
A: 0.04176 + 0.04176 = 0.08352
B: 0.04176 + 0.60439 = 0.64615
etc
Requirements:
I cannot refer to the first column using its column name.
eg.: df.A is not acceptable; df.iloc[:,0] is acceptable.
Attempt:
I tried this using:
print df.add(df.iloc[:,0], fill_value=0)
but it is not working. It returns the error message:
Traceback (most recent call last):
File "C:test.py", line 20, in <module>
print df.add(df.iloc[:,0], fill_value=0)
File "C:\python27\lib\site-packages\pandas\core\ops.py", line 771, in f
return self._combine_series(other, na_op, fill_value, axis, level)
File "C:\python27\lib\site-packages\pandas\core\frame.py", line 2939, in _combine_series
return self._combine_match_columns(other, func, level=level, fill_value=fill_value)
File "C:\python27\lib\site-packages\pandas\core\frame.py", line 2975, in _combine_match_columns
fill_value)
NotImplementedError: fill_value 0 not supported
Is it possible to take the sum of all columns of a DataFrame with the first column?
That's what you need to do:
df.add(df.A, axis=0)
Example:
>>> df = pd.DataFrame(np.random.rand(5,3),columns=['A','B','C'])
>>> col_0 = df.columns.tolist()[0]
>>> print df
A B C
0 0.502962 0.093555 0.854267
1 0.165805 0.263960 0.353374
2 0.386777 0.143079 0.063389
3 0.639575 0.269359 0.681811
4 0.874487 0.992425 0.660696
>>> df = df.add(df.col_0, axis=0)
>>> print df
A B C
0 1.005925 0.596517 1.357229
1 0.331611 0.429766 0.519179
2 0.773553 0.529855 0.450165
3 1.279151 0.908934 1.321386
4 1.748975 1.866912 1.535183
>>>
I would try something like this:
firstol = df.columns[0]
df2 = df.add(df[firstcol], axis=0)
I used a combination of the above two posts to answer this question.
Since I cannot refer to a specific column by its name, I cannot use df.add(df.A, axis=0). But this is along the correct lines. Since df += df[firstcol] produced a dataframe of NaNs, I could not use this approach, but the way that this solution obtains a list of columns from the dataframe was the trick I needed.
Here is how I did it:
col_0 = df.columns.tolist()[0]
print(df.add(df[col_0], axis=0))
You can use numpy and broadcasting for this:
df = pd.DataFrame(df.values + df['A'].values[:, None],
columns=df.columns)
I expect this to be more efficient than series-based methods.

How to call unique() on dask DataFrame

How do I call unique on a dask DataFrame ?
I get the following error if I try to call it the same way as for a regular pandas dataframe:
In [27]: len(np.unique(ddf[['col1','col2']].values))
AttributeError Traceback (most recent call last)
<ipython-input-27-34c0d3097aab> in <module>()
----> 1 len(np.unique(ddf[['col1','col2']].values))
/dir/anaconda2/lib/python2.7/site-packages/dask/dataframe/core.pyc in __getattr__(self, key)
1924 return self._constructor_sliced(merge(self.dask, dsk), name,
1925 meta, self.divisions)
-> 1926 raise AttributeError("'DataFrame' object has no attribute %r" % key)
1927
1928 def __dir__(self):
AttributeError: 'DataFrame' object has no attribute 'values'
For both Pandas and Dask.dataframe you should use the drop_duplicates method
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 1, 2], 'y': [10, 10, 20]})
In [3]: df.drop_duplicates()
Out[3]:
x y
0 1 10
2 2 20
In [4]: import dask.dataframe as dd
In [5]: ddf = dd.from_pandas(df, npartitions=2)
In [6]: ddf.drop_duplicates().compute()
Out[6]:
x y
0 1 10
2 2 20
This works with dask==2022.11.1
ddf.symbol.unique().compute()
I'm not too familiar with Dask, but they appear to have a subset of Pandas functionality, and that subset doesn't seem to include the DataFrame.values attribute.
http://dask.pydata.org/en/latest/dataframe-api.html
You could try this:
sum(ddf[['col1','col2']].apply(pd.Series.nunique, axis=0))
I don't know how it fares performance-wise, but it should provide you with the value (total number of distinct values in col1 and col2 from the ddf DataFrame).

Mapping a Series with a NumPy array -- dimensionality issue?

When I'm using 2d array maps, everything works fine. When I start using 1d arrray's this error occurs; IndexError: unsupported iterator index. This is the error I'm talking about:
In [426]: y = Series( [0,1,0,1] )
In [427]: arr1 = np.array( [10,20] )
In [428]: arr2 = np.array( [[10,20],[30,40]] )
In [429]: arr2[ y, y ]
Out[429]: array([10, 40, 10, 40])
In [430]: arr1[ y ]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-430-25b98edce1f3> in <module>()
----> 1 arr1[ y ]
IndexError: unsupported iterator index
I'm using the latest Anaconda distribution with NumPy 1.8.1. Maybe this is related to a NumPy bug discussed here?
Could anybody tell me what is causing this error?
You need to either convert the Series to a array, or vice-versa. Indexers must be 1-d for a 1-d object.
In [11]: arr1[y.values]
Out[11]: array([10, 20, 10, 20])
In [12]: Series(arr1)[y]
Out[12]:
0 10
1 20
0 10
1 20
dtype: int64