Pandas unique doest not work on groupby object when applied on several columns - pandas

Lets' say I have a dataframe with 3 columns, one containing the groups, and I would to collect the collections of values in the 2 other columns for each group.
Normally I would use the pandas.groupby function and apply the unique method. Well this does not work if unique is applied on more than 1 column...
df = pd.DataFrame({
'group': [1, 1, 2, 3, 3, 3, 4],
'param1': [1, 5, 8, np.nan, 2, 3, np.nan],
'param2': [5,6,9,10,11,12,1]
})
Apply unique on 1 column:
df.groupby('group')['param1'].unique()
group
1 [1.0, 5.0]
2 [8.0]
3 [nan, 2.0, 3.0]
4 [nan]
Name: param1, dtype: object
Apply unique on 2 columns:
df.groupby('group')[['param1', 'param2']].unique()
I get an AttributeError:
AttributeError: 'DataFrameGroupBy' object has no attribute 'unique'
Instead I would expect this dataframe:
param1 param2
group
1 [1.0, 5.0] [5, 6]
2 [8.0] [9]
3 [nan, 2.0, 3.0] [10,11,12]
4 [nan] [1]

Reason of error is unique working only for Series, so is only implemented SeriesGroupBy.unique.
For me working Series.unique with convert to list:
df = df.groupby('group')[['param1', 'param2']].agg(lambda x: list(x.unique()))
print (df)
param1 param2
group
1 [1.0, 5.0] [5, 6]
2 [8.0] [9]
3 [nan, 2.0, 3.0] [10, 11, 12]
4 [nan] [1]

df = df.groupby('group').agg({'param1': 'unique',
'param2': 'unique'})
print(df)
param1 param2
group
1 [1.0, 5.0] [5, 6]
2 [8.0] [9]
3 [nan, 2.0, 3.0] [10, 11, 12]
4 [nan] [1]

If you have many groups, and you want the same behavior (i.e unique) then we can use .stack before the groupby so you don't need to call each column manually.
df.set_index('group').stack(dropna=False).groupby(level=[0,1]).unique().unstack()
param1 param2
group
1 [1.0, 5.0] [5.0, 6.0]
2 [8.0] [9.0]
3 [nan, 2.0, 3.0] [10.0, 11.0, 12.0]
4 [nan] [1.0]

Related

Pandas rolling mean only for non-NaNs

If have a DataFrame:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]
'A1': [1, 1, 2, 2, 2]
'A2': [1, 2, 3, 3, 3]})
I want to create a grouped-by on columns "A1" and "A2" and then apply a rolling-mean on "B" with window 3. If less values are available, that is fine, the mean should still be computed. But I do not want any values if there is no original entry.
Result should be:
pd.DataFrame({'B': [0, 1, 2, np.nan, 3]})
Applying df.rolling(3, min_periods=1).mean() yields:
pd.DataFrame({'B': [0, 1, 2, 2, 3]})
Any ideas?
Reason is for mean with widows=3 is ouput some scalars, not NaNs, possible solution is set NaN manually after rolling:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4],
'A': [1, 1, 2, 2, 2]})
df['C'] = df['B'].rolling(3, min_periods=1).mean().mask(df['B'].isna())
df['D'] = df.groupby('A')['B'].rolling(3, min_periods=1).mean().droplevel(0).mask(df['B'].isna())
print (df)
B A C D
0 0.0 1 0.0 0.0
1 1.0 1 0.5 0.5
2 2.0 2 1.0 2.0
3 NaN 2 NaN NaN
4 4.0 2 3.0 3.0
EDIT: For multiple grouping columns remove levels in Series.droplevel:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4],
'A1': [1, 1, 2, 2, 2],
'A2': [1, 2, 3, 3, 3]})
df['D'] = df.groupby(['A1','A2'])['B'].rolling(3, min_periods=1).mean().droplevel(['A1','A2']).mask(df['B'].isna())
print (df)
B A1 A2 D
0 0.0 1 1 0.0
1 1.0 1 2 1.0
2 2.0 2 3 2.0
3 NaN 2 3 NaN
4 4.0 2 3 3.0

Convert tabular pandas DataFrame into nested pandas DataFrame

Supposing that i have a simple pd.DataFrame like so:
d = {'col1': [1, 20], 'col2': [3, 40], 'col3': [5, 50]}
df = pd.DataFrame(data=d)
df
col1 col2 col4
0 1 3 5
1 20 40 60
is there a way to convert this to nasted pandas Dataframe (df_new) , so as when i call df_new.values[0] taking as ouptut:
array(
[0 1
1 3
2 5
Length: 3, dtype: int], dtype=object)
I still don't think I understand the exact requirement, but here is something:
One way of getting the desired output is this:
>>> pd.Series(df.T[0].values)
0 1
1 3
2 5
dtype: int64
If you want to have these as 2d arrays:
>>> np.array(pd.DataFrame(df.T[0].values).reset_index())
array([[0, 1],
[1, 3],
[2, 5]])
>>> np.array(pd.DataFrame(df.T[1].values).reset_index())
array([[ 0, 20],
[ 1, 40],
[ 2, 50]])

pandas: most elegant way to pivot table on pattern in name of columns

Given the following DataFrame:
pd.DataFrame({
'x': [0, 1],
'y': [0, 1],
'a_idx': [0, 1],
'a_val': [2, 3],
'b_idx': [4, 5],
'b_val': [6, 7],
})
What is the cleanest way to pivot the DataFrame based on the prefix of the idx and val columns if you have an indeterminate amount of unique prefixes (a, b, ... n), so as to obtain the following DataFrame?
pd.DataFrame({
'x': [0, 1, 0, 1],
'y': [0, 1, 0, 1],
'key': ['a','a','b','b'],
'idx': [0, 1, 4, 5],
'val': [2, 3, 6, 7]
})
I am not very knowledgeable in pandas, so my easiest solution was to go earlier in the data generation process and generate a subset of the result DataFrame for each prefix in SQL, and then concat the result sets into a final DataFrame. I'm curious however if there is a simple way to do this using the API of pandas.DataFrame. Is there such a thing?
Let's try wide_to_long with extras:
(pd.wide_to_long(df,stubnames=['a','b'],
i=['x','y'],
j='key',
sep='_',
suffix='\\w+'
)
.unstack('key').stack(level=0).reset_index()
)
Or manually with melt:
out = df.melt(['x', 'y'])
out = (out.join(out['variable'].str.split('_', expand=True))
.rename(columns={0: 'key'})
.pivot_table(index=['x', 'y', 'key'], columns=[1], values='value')
.reset_index()
)
Output:
key x y level_2 idx val
0 0 0 a 0 2
1 0 0 b 4 6
2 1 1 a 1 3
3 1 1 b 5 7

Numpy get column of two dimensional matrix as array

I have a matrix that looks like that:
>> X
>>
[[5.1 1.4]
[4.9 1.4]
[4.7 1.3]
[4.6 1.5]
[5. 1.4]]
I want to get its first column as an array of [5.1, 4.9, 4.7, 4.6, 5.]
However when I try to get it by X[:,0] i get
>> [[5.1]
[4.9]
[4.7]
[4.6]
[5. ]]
which is something different. How to get it as an array ?
You can use list comprehensions for this kind of thing..
import numpy as np
X = np.array([[5.1, 1.4], [4.9, 1.4], [4.7, 1.3], [4.6, 1.5], [5.0, 1.4]])
X_0 = [i for i in X[:,0]]
print(X_0)
Output..
[5.1, 4.9, 4.7, 4.6, 5.0]
Almost there! Just reshape your result:
X[:,0].reshape(1,-1)
Outputs:
[[5.1 4.9 4.7 4.6 5. ]]
Full code:
import numpy as np
X=np.array([[5.1 ,1.4],[4.9 ,1.4], [4.7 ,1.3], [4.6 ,1.5], [5. , 1.4]])
print(X)
print(X[:,0].reshape(1,-1))
With regular numpy array:
In [3]: x = np.arange(15).reshape(5,3)
In [4]: x
Out[4]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
In [5]: x[:,0]
Out[5]: array([ 0, 3, 6, 9, 12])
With np.matrix (use discouraged if not actually deprecated)
In [6]: X = np.matrix(x)
In [7]: X
Out[7]:
matrix([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
In [8]: print(X)
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]
[12 13 14]]
In [9]: X[:,0]
Out[9]:
matrix([[ 0],
[ 3],
[ 6],
[ 9],
[12]])
In [10]: X[:,0].T
Out[10]: matrix([[ 0, 3, 6, 9, 12]])
To get 1d array, convert to array and ravel, or in one step:
In [11]: X[:,0].A1
Out[11]: array([ 0, 3, 6, 9, 12])

Pandas: Replace every value in every column matching a pattern with a value from another column in that row

I have a dataframe with 1000 columns. I want to replace every -9 value in every column with that row's df['a'] value.
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, -9, 8, np.nan, -9], 'c': [-9, 19, -9, -9, -9]})
What I want is
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, 2, 8, np.nan, 5], 'c': [1, 19, 3, 4, 5]})
I have tried
df.replace(-9, df['a'], inplace = True)
And
df.replace(-9, np.nan, inplace = True)
df.fillna(df.a, inplace = True)
But they don't change the df.
My solution right now is to use a for loop:
df.replace(-9, np.nan, inplace = True)
col_list = list(df)
for i in col_list:
df[i].fillna(df['a'], inplace = True)
This solution works, but it also replaces any np.nan values. Any ideas as to how I can replace just the -9 values without first converting it into np.nan? Thanks.
I think need mask:
df = df.mask(df == -9, df['a'], axis=0)
print (df)
a b c
0 1 6.0 1
1 2 2.0 19
2 3 8.0 3
3 4 NaN 4
4 5 5.0 5
Or:
df = pd.DataFrame(np.where(df == -9, df['a'].values[:, None], df), columns=df.columns)
print (df)
a b c
0 1.0 6.0 1.0
1 2.0 2.0 19.0
2 3.0 8.0 3.0
3 4.0 NaN 4.0
4 5.0 5.0 5.0
you can also do something like this
import numpy as np
import pandas as pd
df_tar = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, 2, 8, np.nan, 5], 'c': [1, 19, 3, 4, 5]})
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, -9, 8, np.nan, -9], 'c': [-9, 19, -9, -9, -9]})
df.loc[df['b']==-9,'b']=df.loc[df['b']==-9,'a']
df.loc[df['c']==-9,'c']=df.loc[df['c']==-9,'a']