groupby transform with if condition in pandas

groupby transform with if condition in pandas - pandas

I have a data frame as given below
df = pd.DataFrame({'key': ['a', 'a', 'a', 'b', 'c', 'c'] , 'val' : [10, np.nan, 9 , 10, 11, 13]})
df
key val
0 a 10.0
1 a NaN
2 a 9.0
3 b 10.0
4 c 11.0
5 c 13.0
I want to perform groupby and transform that new column is each value divided by group mean , which I can do as below
df['new'] = df.groupby('key')['val'].transform(lambda g : g/g.mean())
df.new
0 1.052632
1 NaN
2 0.947368
3 1.000000
4 0.916667
5 1.083333
Name: new, dtype: float64
Now I have condition that if val is np.nan then new column value will be np.inf which should result as below
0 1.052632
1 np.inf
2 0.947368
3 1.000000
4 0.916667
5 1.083333
Name: new, dtype: float64
In other words how can I have this check if a val is np.nan with groupby and transform.
Thanks in advance

Add Series.replace:
df['new'] = (df.groupby('key')['val'].transform(lambda g : g/g.mean())
.replace(np.nan, np.inf))
print (df)
key val new
0 a 10.0 1.052632
1 a NaN inf
2 a 9.0 0.947368
3 b 10.0 1.000000
4 c 11.0 0.916667
5 c 13.0 1.083333
Or numpy.where:
df['new'] = np.where(df.val.isna(),
np.inf, df.groupby('key')['val'].transform(lambda g : g/g.mean()))
print (df)
key val new
0 a 10.0 1.052632
1 a NaN inf
2 a 9.0 0.947368
3 b 10.0 1.000000
4 c 11.0 0.916667
5 c 13.0 1.083333

Related

In pandas, replace table column with Series while joining indexes

I have a table with preexisting columns, and I want to entirely replace some of those columns with values from a series. The tricky part is that each series will have different indexes and I need to add these varying indexes to the table as necessary, like doing a join/merge operation.
For example, this code generates a table and 5 series where each series only has a subset of the indexes.
import random
cols=['a', 'b', 'c', 'd', 'e', 'f', 'g']
table = pd.DataFrame(columns=cols)
series = []
for i in range(5):
series.append(
pd.Series(
np.random.randint(0, 3, 2)*10,
index=pd.Index(random.sample(range(3), 2))
)
)
series
Output:
[1 10
2 0
dtype: int32,
2 0
0 20
dtype: int32,
2 20
1 0
dtype: int32,
2 0
0 10
dtype: int32,
1 20
2 10
dtype: int32]
But when I try to replace columns of the table with the series, a simple assignment doesn't work
for i in range(5):
col = cols[i]
table[col] = series[i]
table
Output:
a b c d e f g
1 10 NaN 0 NaN 20 NaN NaN
2 0 0 20 0 10 NaN NaN
because the assignment won't add any more indexes after the first series is assigned
Other things I've tried:
combine or combine_first gives the same result as above. (table[col] = table[col].combine(series[i], lambda a, b: b) and table[col] = series[i].combine_first(table[col]))
pd.concat doesn't work either because of duplicate labels (table[col] = pd.concat([table[col], series[i]]) gives ValueError: cannot reindex on an axis with duplicate labels) and I can't just drop the duplicates because other columns may already have values in those indexes
DataFrame.update won't work since it only takes indexes from the table (join='left'). I need to add indexes from the series to the table as necessary.
Of course, I can always do something like this:
table = table.join(series[i].rename('new'), how='outer')
table[col] = table.pop('new')
which gives the correct result:
a b c d e f g
0 NaN 20.0 NaN 10.0 NaN NaN NaN
1 10.0 NaN 0.0 NaN 20.0 NaN NaN
2 0.0 0.0 20.0 0.0 10.0 NaN NaN
But it's doing it in quite a roundabout way, and still isn't robust to column name collisions, so you'd have to add a handful more code to fiddle with column names and protect against that. This produces quite verbose and ugly code for what is a conceptually a very simple operation, that I believe there must be a better way of doing it.

pd.concat should work along the column axis:
out = pd.concat(series, axis=1)
print(out)
# Output
0 1 2 3 4
0 10.0 0.0 0.0 NaN 10.0
1 NaN 10.0 NaN 0.0 20.0
2 0.0 NaN 0.0 0.0 NaN

You could try constructing the dataframe using a dict comprehension like this:
series:
[0 10
1 0
dtype: int64,
0 0
1 0
dtype: int64,
2 20
0 0
dtype: int64,
0 20
2 0
dtype: int64,
0 0
1 0
dtype: int64]
code:
table = pd.DataFrame({
col: series[i]
for i, col in enumerate(cols)
if i < len(series)
})
table
output:
a b c d e
0 10.0 0.0 0.0 20.0 0.0
1 0.0 0.0 NaN NaN 0.0
2 NaN NaN 20.0 0.0 NaN
If you really need the nan columns at the end you could do:
table = pd.DataFrame({
col: series[i] if i < len(series) else np.nan
for i, col in enumerate(cols)
})
Output:
a b c d e f g
0 10.0 0.0 0.0 20.0 0.0 NaN NaN
1 0.0 0.0 NaN NaN 0.0 NaN NaN
2 NaN NaN 20.0 0.0 NaN NaN NaN

Setting multiple column at once give error "Not in index error!"

import pandas as pd
df = pd.DataFrame(
[
[5, 2],
[3, 5],
[5, 5],
[8, 9],
[90, 55]
],
columns = ['max_speed', 'shield']
)
df.loc[(df.max_speed > df.shield), ['stat', 'delta']] \
= 'overspeed', df['max_speed'] - df['shield']
I am setting multiple column using .loc as above, for some cases I get Not in index error!. Am I doing something wrong above?

Create list of tuples by same size like number of Trues with filtered Series after subtract with repeat scalar overspeed:
m = (df.max_speed > df.shield)
s = df['max_speed'] - df['shield']
df.loc[m, ['stat', 'delta']] = list(zip(['overspeed'] * m.sum(), s[m]))
print(df)
max_speed shield stat delta
0 5 2 overspeed 3.0
1 3 5 NaN NaN
2 5 5 NaN NaN
3 8 9 NaN NaN
4 90 55 overspeed 35.0
Another idea with helper DataFrame:
df.loc[m, ['stat', 'delta']] = pd.DataFrame({'stat':'overspeed', 'delta':s})[m]
Details:
print(list(zip(['overspeed'] * m.sum(), s[m])))
[('overspeed', 3), ('overspeed', 35)]
print (pd.DataFrame({'stat':'overspeed', 'delta':s})[m])
stat delta
0 overspeed 3
4 overspeed 35
Simpliest is assign separately:
df.loc[m, 'stat'] = 'overspeed'
df.loc[m, 'delta'] = df['max_speed'] - df['shield']
print(df)
max_speed shield stat delta
0 5 2 overspeed 3.0
1 3 5 NaN NaN
2 5 5 NaN NaN
3 8 9 NaN NaN
4 90 55 overspeed 35.0

Replace all values from one pandas dataframe to another without extra columns

These are my two dataframes:
df1 = pd.DataFrame({'animal': ['falcon', 'dog', 'spider', 'fish'],'num_legs': [2, 4, 8, 0],'num_wings': [2, 0, 0, 0],'num_specimen_seen': [10, 2, 1, 8]})
df2 = pd.DataFrame({'animal': ['falcon', 'dog'],'num_legs': [4, 2],'num_wings': [0, 2],'num_specimen_seen': [2, 10]})
When I use left join , this is the result:
merge = df1.merge(df2, on='animal', how='left')
Output:
animal num_legs_x num_wings_x num_specimen_seen_x num_legs_y num_wings_y num_specimen_seen_y
falcon 2 2 10 4 0 2
dog 4 0 2 2 2 10
spider 8 0 1 NaN NaN NaN
fish 0 0 8 NaN NaN NaN
I am looking for an output like this , where row 1 and 2 values are replaced by values coming from df2 :
animal num_legs num_wings num_specimen_seen
falcon 4 0 2
dog 2 2 10
spider 8 0 1
fish 0 0 8
I attempted using np.where but couldnt write something correctly
df = np.where(df1.animal == df2.animal, ?, ?)
Maybe left join isnt correct way to achieve what I want. I am new to pandas , any help would be appreciated.

Let us do update
df1 = df1.set_index('animal')
df1.update(df2.set_index('animal'))
df1 = df1.reset_index()
df1
animal num_legs num_wings num_specimen_seen
0 falcon 4.0 0.0 2.0
1 dog 2.0 2.0 10.0
2 spider 8.0 0.0 1.0
3 fish 0.0 0.0 8.0

How to do pd.fillna() with condition

Am trying to do a fillna with if condition
Fimport pandas as pd
df = pd.DataFrame(data={'a':[1,None,3,None],'b':[4,None,None,None]})
print df
df[b].fillna(value=0, inplace=True) only if df[a] is None
print df
a b
0 1 4
1 NaN NaN
2 3 NaN
3 NaN NaN
##What i want to acheive
a b
0 1 4
1 NaN 0
2 3 NaN
3 NaN 0
Please help

You can chain both conditions for test mising values with & for bitwise AND and then replace values to 0:
df.loc[df.a.isna() & df.b.isna(), 'b'] = 0
#alternative
df.loc[df[['a', 'b']].isna().all(axis=1), 'b'] = 0
print (df)
a b
0 1.0 4.0
1 NaN 0.0
2 3.0 NaN
3 NaN 0.0
Or you can use fillna with one condition:
df.loc[df.a.isna(), 'b'] = df.b.fillna(0)

Pandas dataframe creating multiple rows at once via .loc

I can create a new row in a dataframe using .loc():
>>> df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index='1 2'.split())
>>> df
a b
1 10 100
2 20 200
>>> df.loc[3, 'a'] = 30
>>> df
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
But how can I create more than one row using the same method?
>>> df.loc[[4, 5], 'a'] = [40, 50]
...
KeyError: '[4 5] not in index'
I'm familiar with .append() but am looking for a way that does NOT require constructing a new row into a Series before having it appended to df.
Desired input:
>>> df.loc[[4, 5], 'a'] = [40, 50]
Desired output
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
Where last 2 rows are newly added.

Admittedly, this is a very late answer, but I have had to deal with a similar problem and think my solution might be helpful to others as well.
After recreating your data, it is basically a two-step approach:
Recreate data:
import pandas as pd
df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index='1 2'.split())
df.loc[3, 'a'] = 30
Extend the df.index using .reindex:
idx = list(df.index)
new_rows = list(map(str, range(4, 6))) # easier extensible than new_rows = ["4", "5"]
idx.extend(new_rows)
df = df.reindex(index=idx)
Set the values using .loc:
df.loc[new_rows, "a"] = [40, 50]
giving you
>>> df
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN

Example data
>>> data = pd.DataFrame({
'a': [10, 6, -3, -2, 4, 12, 3, 3],
'b': [6, -3, 6, 12, 8, 11, -5, -5],
'id': [1, 1, 1, 1, 6, 2, 2, 4]})
Case 1 Note that range can be altered to whatever it is that you desire.
>>> for i in range(10):
... data.loc[i, 'a'] = 30
...
>>> data
a b id
0 30.0 6.0 1.0
1 30.0 -3.0 1.0
2 30.0 6.0 1.0
3 30.0 12.0 1.0
4 30.0 8.0 6.0
5 30.0 11.0 2.0
6 30.0 -5.0 2.0
7 30.0 -5.0 4.0
8 30.0 NaN NaN
9 30.0 NaN NaN
Case 2 Here we are adding a new column to a data frame that had 8 rows to begin with. As we extend our new column c to be of length 10 the other columns are extended with NaN.
>>> for i in range(10):
... data.loc[i, 'c'] = 30
...
>>> data
a b id c
0 10.0 6.0 1.0 30.0
1 6.0 -3.0 1.0 30.0
2 -3.0 6.0 1.0 30.0
3 -2.0 12.0 1.0 30.0
4 4.0 8.0 6.0 30.0
5 12.0 11.0 2.0 30.0
6 3.0 -5.0 2.0 30.0
7 3.0 -5.0 4.0 30.0
8 NaN NaN NaN 30.0
9 NaN NaN NaN 30.0

Also somewhat late, but my solution was similar to the accepted one:
import pandas as pd
df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index=[1,2])
# single index assignment always works
df.loc[3, 'a'] = 30
# multiple indices
new_rows = [4,5]
# there should be a nicer way to add more than one index/row at once,
# but at least this is just one extra line:
df = df.reindex(index=df.index.append(pd.Index(new_rows))) # note: Index.append() doesn't accept non-Index iterables?
# multiple new rows now works:
df.loc[new_rows, "a"] = [40, 50]
print(df)
... which yields:
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
This also works now (useful when performance on aggregating dataframes matters):
# inserting whole rows:
df.loc[new_rows] = [[41, 51], [61,71]]
print(df)
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 41.0 51.0
5 61.0 71.0

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

groupby transform with if condition in pandas - pandas

Related

In pandas, replace table column with Series while joining indexes

Setting multiple column at once give error "Not in index error!"

Replace all values from one pandas dataframe to another without extra columns

How to do pd.fillna() with condition

Pandas dataframe creating multiple rows at once via .loc

Categories

Resources