Pandas Groupby and Apply - pandas

I am performing a grouby and apply over a dataframe that is returning some strange results, I am using pandas 1.3.1
Here is the code:
ddf = pd.DataFrame({
"id": [1,1,1,1,2]
})
def do_something(df):
return "x"
ddf["title"] = ddf.groupby("id").apply(do_something)
ddf
I would expect every row in the title column to be assigned the value "x" but when this happens I get this data:
id title
0 1 NaN
1 1 x
2 1 x
3 1 NaN
4 2 NaN
Is this expected?

The result is not strange, it's the right behavior: apply returns a value for the group, here 1 and 2 which becomes the index of the aggregation:
>>> list(ddf.groupby("id"))
[(1, # the group name (the future index of the grouped df)
id # the subset dataframe of the group 2
0 1
1 1
2 1
3 1),
(2, # the group name (the future index of the grouped df)
id # the subset dataframe of the group 2
4 2)]
Why I have a result? Because the label of the group is found in the same of your dataframe index:
>>> ddf.groupby("id").apply(do_something)
id
1 x
2 x
dtype: object
Now change the id like this:
ddf['id'] += 10
# id
# 0 11
# 1 11
# 2 11
# 3 11
# 4 12
ddf["title"] = ddf.groupby("id").apply(do_something)
# id title
# 0 11 NaN
# 1 11 NaN
# 2 11 NaN
# 3 11 NaN
# 4 12 NaN
Or change the index:
ddf.index += 10
# id
# 10 1
# 11 1
# 12 1
# 13 1
# 14 2
ddf["title"] = ddf.groupby("id").apply(do_something)
# id title
# 10 1 NaN
# 11 1 NaN
# 12 1 NaN
# 13 1 NaN
# 14 2 NaN

Yes it is expected.
First of all the apply(do_something) part works like a charme, it is the groupby right before that causes the problem.
A Groupby returns a groupby object, which is a little different to a normal dataframe. If you debug and inspect what the groupby returns, then you can see you need some form of summary function to use it(mean max or sum).If you run one of them as example like this:
df = ddf.groupby("id")
df.mean()
it leads to this result:
Empty DataFrame
Columns: []
Index: [1, 2]
After that do_something is applied to index 1 and 2 only; and then integrated into your original df. This is why you only have index 1 and 2 with x.
For now I would recommend leave out the groupby since it is not clear why you want to use it here anyway.
And have a deeper look into the groupby object

If need new column in aggregate function use GroupBy.transform, is necessary specified column after groupby used for processing, here id:
ddf["title"] = ddf.groupby("id")['id'].transform(do_something)
Or assign new column in function:
def do_something(x):
x['title'] = 'x'
return x
ddf = ddf.groupby("id").apply(do_something)
Explanation why not workin gis in another answers.

Related

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0
Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B
Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

Pandas Series Chaining: Filter on boolean value

How can I filter a pandas series based on boolean values?
Currently I have:
s.apply(lambda x: myfunc(x, myparam).where(lambda x: x).dropna()
What I want is only keep entries where myfunc returns true.myfunc is complex function using 3rd party code and operates only on individual elements.
How can i make this more understandable?
You can understand it with below given sample code
import pandas as pd
data = pd.Series([1,12,15,3,5,3,6,9,10,5])
print(data)
# filter data based on a condition keep only rows which are multiple of 3
filter_cond = data.apply(lambda x:x%3==0)
print(filter_cond)
filter_data = data[filter_cond]
print(filter_data)
This code is about to filter the series data which are of the multiples of 3. To do that, we just put the filter condition and apply it on the series data. You can verify it with below generated output.
The sample series data:
0 1
1 12
2 15
3 3
4 5
5 3
6 6
7 9
8 10
9 5
dtype: int64
The conditional filter output:
0 False
1 True
2 True
3 True
4 False
5 True
6 True
7 True
8 False
9 False
dtype: bool
The final required filter data:
1 12
2 15
3 3
5 3
6 6
7 9
dtype: int64
Hope, this helps you to understand that how we can apply conditional filters on the series data.
Use boolean indexing:
mask = s.apply(lambda x: myfunc(x, myparam))
print (s[mask])
If also is changed index values in mask filter by 1d array:
#pandas 0.24+
print (s[mask.to_numpy()])
#pandas below
print (s[mask.values])
EDIT:
s = pd.Series([1,2,3])
def myfunc(x, n):
return x > n
myparam = 1
a = s[s.apply(lambda x: myfunc(x, myparam))]
print (a)
1 2
2 3
dtype: int64
Solution with callable is possible, but a bit overcomplicated in my opinion:
a = s.loc[lambda s: s.apply(lambda x: myfunc(x, myparam))]
print (a)
1 2
2 3
dtype: int64

Return Value Based on Conditional Lookup on Different Pandas DataFrame

Objective: to lookup value from one data frame (conditionally) and place the results in a different dataframe with a new column name
df_1 = pd.DataFrame({'user_id': [1,2,1,4,5],
'name': ['abc','def','ghi','abc','abc'],
'rank': [6,7,8,9,10]})
df_2 = pd.DataFrame ({'user_id': [1,2,3,4,5]})
df_1 # original data
df_2 # new dataframe
In this general example, I am trying to create a new column named "priority_rank" and only fill "priority_rank" based on the conditional lookup against df_1, namely the following:
user_id must match between df_1 and df_2
I am interested in only df_1['name'] == 'abc' all else should be blank
df_2 should end up looking like this:
|user_id|priority_rank|
1 6
2
3
4 9
5 10
One way to do this:
In []:
df_2['priority_rank'] = np.where((df_1.name=='abc') & (df_1.user_id==df_2.user_id), df_1['rank'], '')
df_2
Out[]:
user_id priority_rank
0 1 6
1 2
2 3
3 4 9
4 5 10
Note: In your example df_1.name=='abc' is a sufficient condition because all values for user_id are identical when df_1.name=='abc'. I'm assuming this is not always going to be the case.
Using merge
df_2.merge(df_1.loc[df_1.name=='abc',:],how='left').drop('name',1)
Out[932]:
user_id rank
0 1 6.0
1 2 NaN
2 3 NaN
3 4 9.0
4 5 10.0
You're looking for map:
df_2.assign(priority_rank=df_2['user_id'].map(
df_1.query("name == 'abc'").set_index('user_id')['rank']))
user_id priority_rank
0 1 6.0
1 2 NaN
2 3 NaN
3 4 9.0
4 5 10.0

sorting within the keys of group by

I have a group by table as follows, I want to sort by index within the keys ['CPUCore', Offline_RetetionAge'] (need to keep the structure of ['CPUCore', Offline_RetetionAge']) how should I do?
I think there is problem dtype of your second level is object, what is obviously string, so if use sort_index it sorts alphanumeric:
df = pd.DataFrame({'CPUCore':[2,2,2,3,3],
'Offline_RetetionAge':['100','1','12','120','15'],
'index':[11,16,5,4,3]}).set_index(['CPUCore','Offline_RetetionAge'])
print (df)
index
CPUCore Offline_RetetionAge
2 100 11
1 16
12 5
3 120 4
15 3
print (df.index.get_level_values('Offline_RetetionAge').dtype)
object
print (df.sort_index())
index
CPUCore Offline_RetetionAge
2 1 16
100 11
12 5
3 120 4
15 3
#change multiindex - cast level Offline_RetetionAge to int
new_index = list(zip(df.index.get_level_values('CPUCore'),
df.index.get_level_values('Offline_RetetionAge').astype(int)))
df.index = pd.MultiIndex.from_tuples(new_index, names = df.index.names)
print (df.sort_index())
index
CPUCore Offline_RetetionAge
2 1 16
12 5
100 11
3 15 3
120 4
EDIT by comment:
print (df.reset_index()
.sort_values(['CPUCore','index'])
.set_index(['CPUCore','Offline_RetetionAge']))
index
CPUCore Offline_RetetionAge
2 12 5
100 11
1 16
3 15 3
120 4
I think what you mean is this:
import pandas as pd
from pandas import Series, DataFrame
# create what I believe you tried to ask
df = DataFrame( \
[[11,'reproducible'], [16, 'example'], [5, 'a'], [4, 'create'], [9,'!']])
df.columns = ['index', 'bla']
df.index = pd.MultiIndex.from_arrays([[2]*4+[3],[10,100,1000,11,512]], \
names=['CPUCore', 'Offline_RetentionAge'])
# sort by values and afterwards by index where sort_remaining=False preserves
# the order of index
df = df.sort_values('index').sort_index(level=0, sort_remaining=False)
print df
The statement sort_values sorts the values by index and the sort_index restores the grouping by multiindex without changing the order of index for rows with the same CPUCore.
I don't know what a "group by table" is supposed to be. If you have a pd.GroupBy object, you won't be able to use sort_values() like that.
You might have to rethink what you group by or use functools.partial and DataFrame.apply
Output:
index bla
CPUCore Offline_RetentionAge
2 11 4 create
1000 5 a
10 11 reproducible
100 16 example
3 512 9 !

Pandas dropping columns by index drops all columns with same name

Consider following dataframe which has columns with same name (Apparently this does happens, currently I have a dataset like this! :( )
>>> df = pd.DataFrame({"a":range(10,15),"b":range(5,10)})
>>> df.rename(columns={"b":"a"},inplace=True)
df
a a
0 10 5
1 11 6
2 12 7
3 13 8
4 14 9
>>> df.columns
Index(['a', 'a'], dtype='object')
I would expect that when dropping by index , only the column with the respective index would be gone, but apparently this is not the case.
>>> df.drop(df.columns[-1],1)
0
1
2
3
4
Is there a way to get rid of columns with duplicated column names?
EDIT: I choose missleading values for the first column, fixed now
EDIT2: the expected outcome is
a
0 10
1 11
2 12
3 13
4 14
Actually just do this:
In [183]:
df.ix[:,~df.columns.duplicated()]
Out[183]:
a
0 0
1 1
2 2
3 3
4 4
So this index all rows and then uses the column mask generated from duplicated and invert the mask using ~
The output from duplicated:
In [184]:
df.columns.duplicated()
Out[184]:
array([False, True], dtype=bool)
UPDATE
As .ix is deprecated (since v0.20.1) you should do any of the following:
df.iloc[:,~df.columns.duplicated()]
or
df.loc[:,~df.columns.duplicated()]
Thanks to #DavideFiocco for alerting me