How to modify dataframe based on column values - pandas

I want to add relationships to column 'relations' based on rel_list. Specifically, for each tuple, i.e. ('a', 'b'), I want to replace the relationships column value '' with 'b' in the first row, but no duplicate, meaning that for the 2nd row, don't replace '' with 'a', since they are considered as duplicated. The following code doesn't work fully correct:
import pandas as pd
data = {
"names": ['a', 'b', 'c', 'd'],
"ages": [50, 40, 45, 20],
"relations": ['', '', '', '']
}
rel_list = [('a', 'b'), ('a', 'c'), ('c', 'd')]
df = pd.DataFrame(data)
for rel_tuple in rel_list:
head = rel_tuple[0]
tail = rel_tuple[1]
df.loc[df.names == head, 'relations'] = tail
print(df)
The current result of df is:
names ages relations
0 a 50 c
1 b 40
2 c 45 d
3 d 20
However, the correct one is:
names ages relations
0 a 50 b
0 a 50 c
1 b 40
2 c 45 d
3 d 20
There are new rows that need to be added. The 2nd row in this case, like above. How to do that?

You can craft a dataframe and merge:
(df.drop('relations', axis=1)
.merge(pd.DataFrame(rel_list, columns=['names', 'relations']),
on='names',
how='outer'
)
# .fillna('') # uncomment to replace NaN with empty string
)
Output:
names ages relations
0 a 50 b
1 a 50 c
2 b 40 NaN
3 c 45 d
4 d 20 NaN

Instead of updating df you can create a new one and add relations row by row:
import pandas as pd
data = {
"names": ['a', 'b', 'c', 'd'],
"ages": [50, 40, 45, 20],
"relations": ['', '', '', '']
}
rel_list = [('a', 'b'), ('a', 'c'), ('c', 'd')]
df = pd.DataFrame(data)
new_df = pd.DataFrame(data)
new_df.loc[:, 'relations'] = ''
for head, tail in rel_list:
new_row = df[df.names == head]
new_row.loc[:,'relations'] = tail
new_df = new_df.append(new_row)
print(new_df)
Output:
names ages relations
0 a 50
1 b 40
2 c 45
3 d 20
0 a 50 b
0 a 50 c
2 c 45 d
Then, if needed, in the end you can delete all rows without value in 'relations':
new_df = new_df[new_df['relations']!='']

Related

Pandas, groupby include number of rows grouped in each row

Have any way to use
df = pd.read_excel(r'a.xlsx')
df2 = df.groupby(by=["col"], as_index=False).mean()
Include new column with number of rows grouped in each row?
in absence of sample data, I'm assuming you have multiple numeric columns
can use apply() to then calculate all means and append len() to this series
df = pd.DataFrame(
{
"col": np.random.choice(list("ABCD"), 200),
"val": np.random.uniform(1, 5, 200),
"val2": np.random.uniform(5, 10, 200),
}
)
df2 = df.groupby(by=["col"], as_index=False).apply(
lambda d: d.select_dtypes("number").mean().append(pd.Series({"len": len(d)}))
)
df2
col
val
val2
len
0
A
3.13064
7.63837
42
1
B
3.1057
7.50656
44
2
C
3.0111
7.82628
54
3
D
3.20709
7.32217
60
comment code
def w_avg(df, values, weights, exp):
d = df[values]
w = df[weights] ** exp
return (d * w).sum() / w.sum()
dfg1 = pd.DataFrame(
{
"Jogador": np.random.choice(list("ABCD"), 200),
"Evento": np.random.choice(list("XYZ"),200),
"Rating Calculado BW": np.random.uniform(1, 5, 200),
"Lances": np.random.uniform(5, 10, 200),
}
)
dfg = dfg1.groupby(by=["Jogador", "Evento"]).apply(
lambda dfg1: dfg1.select_dtypes("number")
.agg(lambda d: w_avg(dfg1, "Rating Calculado BW", "Lances", 1))
.append(pd.Series({"len": len(dfg1)}))
)
dfg

Pandas add a summary column that counts values that are not empty strings

I have a table that looks like this:
A B C
1 foo
2 foobar blah
3
I want to count up the non empty columns from A, B and C to get a summary column like this:
A B C sum
1 foo 1
2 foobar blah 2
3 0
Here is how I'm trying to do it:
import pandas as pd
df = { 'A' : ["foo", "foobar", ""],
'B' : ["", "blah", ""],
'C' : ["","",""]}
df = pd.DataFrame(df)
print(df)
df['sum'] = df[['A', 'B', 'C']].notnull().sum(axis=1)
df['sum'] = (df[['A', 'B', 'C']] != "").sum(axis=1)
These last two lines are different ways to get what I want but they aren't working. Any suggestions?
df['sum'] = (df[['A', 'B', 'C']] != "").sum(axis=1)
Worked. Thanks for the assistance.
This one-liner worked for me :)
df["sum"] = df.replace("", np.nan).T.count().reset_index().iloc[:,1]

How to replace pd.NamedAgg to a code compliant with pandas 0.24.2?

Hello I am obliged to downgrade Pandas versioon to '0.24.2'
As a result, the function pd.NamedAgg is not recognizable anymore.
import pandas as pd
import numpy as np
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
Can you help me please change my code to make it compliant with the version 0.24.2??
Thank you a lot.
Sample:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
df = pd.DataFrame({
'A':list('a')*6,
'B':[4,5,4,5,5,4],
'C':[7]*6,
'Foo':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Because there is only one column Foo for processing add column Foo after groupby and pass tuples with new columns names with aggregate functions:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Another idea is pass dictionary of lists of aggregate functions:
agg_df = df.groupby(agg_cols).agg({'Foo':['max', 'min']})
agg_df.columns = [f'{b}_{a}' for a, b in agg_df.columns]
agg_df = agg_df.reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1

Dual sort based on condition within a groupby

[EDIT] Changed df size to 1k and provided piecemeal code for expected result.
Have the following df:
import random
random.seed(1234)
sz = 1000
typ = ['a', 'b', 'c']
sub_typ = ['s1', 's2', 's3', 's4']
ifs = ['A', 'D']
col_sort = np.random.randint(0, 10, size=sz)
col_val = np.random.randint(100, 1000, size=sz)
df = pd.DataFrame({'typ': random.choices(typ, k=sz),
'sub_typ': random.choices(sub_typ, k=sz),
'col_if': random.choices(ifs, k=sz),
'col_sort': col_sort,
'value': col_val})
Would like to sort within groupby of [typ] and [sub_typ] fields, such that it sorts [col_sort] field in ascending order if [col_if] == 'A' and in descending order if [col_if] == 'D' and pick up the first 3 values of the sorted dataframe, in one line of code.
Expected result is like df_result below:
df_A = df[df.col_if == 'A']
df_D = df[df.col_if == 'D']
df_A_sorted_3 = df_A.groupby(['typ', 'sub_typ'], as_index=False).apply(lambda x:
x.sort_values('col_sort', ascending=True)).\
groupby(['typ', 'sub_typ', 'col_sort']).head(3)
df_D_sorted_3 = df_D.groupby(['typ', 'sub_typ'], as_index=False).apply(lambda x:
x.sort_values('col_sort', ascending=False)).\
groupby(['typ', 'sub_typ', 'col_sort']).head(3)
df_result = pd.concat([df_A_sorted_3, df_D_sorted_3]).reset_index(drop=True)
Tried:
df.groupby(['typ', 'sub_typ']).apply(lambda x: x.sort_values('col_sort', ascending=True)
if x.col_if == 'A' else x.sort_values('col_sort',
ascending=False)).groupby(['typ', 'sub_typ', 'col_sort']).head(3)
...but it gives the error:
ValueError: The truth value of a Series is ambiguous.
Sorting per groups is same like sorting by multiple columns, but if need same output is necessary kind='mergesort'.
So for improve performance I suggest NOT sorting per groups in groupby:
np.random.seed(1234)
sz = 1000
typ = ['a', 'b', 'c']
sub_typ = ['s1', 's2', 's3', 's4']
ifs = ['A', 'D']
col_sort = np.random.randint(0, 10, size=sz)
col_val = np.random.randint(100, 1000, size=sz)
df = pd.DataFrame({'typ': np.random.choice(typ, sz),
'sub_typ': np.random.choice(sub_typ, sz),
'col_if': np.random.choice(ifs, sz),
'col_sort': col_sort,
'value': col_val})
# print (df)
df_A = df[df.col_if == 'A']
df_D = df[df.col_if == 'D']
df_A_sorted_3 = (df_A.sort_values(['typ', 'sub_typ','col_sort'])
.groupby(['typ', 'sub_typ', 'col_sort'])
.head(3))
df_D_sorted_3 = (df_D.sort_values(['typ', 'sub_typ','col_sort'], ascending=[True, True, False])
.groupby(['typ', 'sub_typ', 'col_sort'])
.head(3))
df_result = pd.concat([df_A_sorted_3, df_D_sorted_3]).reset_index(drop=True)
print (df_result)
typ sub_typ col_if col_sort value
0 a s1 A 0 709
1 a s1 A 0 710
2 a s1 A 0 801
3 a s1 A 1 542
4 a s1 A 1 557
.. .. ... ... ... ...
646 c s4 D 1 555
647 c s4 D 1 233
648 c s4 D 0 501
649 c s4 D 0 436
650 c s4 D 0 695
[651 rows x 5 columns]
Compare outputs:
df_A_sorted_3 = df_A.groupby(['typ', 'sub_typ'], as_index=False).apply(lambda x:
x.sort_values('col_sort', ascending=True, kind='mergesort')).\
groupby(['typ', 'sub_typ', 'col_sort']).head(3)
df_D_sorted_3 = df_D.groupby(['typ', 'sub_typ'], as_index=False).apply(lambda x:
x.sort_values('col_sort', ascending=False, kind='mergesort')).\
groupby(['typ', 'sub_typ', 'col_sort']).head(3)
df_result = pd.concat([df_A_sorted_3, df_D_sorted_3]).reset_index(drop=True)
print (df_result)
typ sub_typ col_if col_sort value
0 a s1 A 0 709
1 a s1 A 0 710
2 a s1 A 0 801
3 a s1 A 1 542
4 a s1 A 1 557
.. .. ... ... ... ...
646 c s4 D 1 555
647 c s4 D 1 233
648 c s4 D 0 501
649 c s4 D 0 436
650 c s4 D 0 695
[651 rows x 5 columns]
EDIT: Possible, but slow:
def f(x):
a = x[x.col_if == 'A'].sort_values('col_sort', ascending=True, kind='mergesort')
d = x[x.col_if == 'D'].sort_values('col_sort', ascending=False, kind='mergesort')
return pd.concat([a,d], sort=False)
df_result = (df.groupby(['typ', 'sub_typ','col_if'], as_index=False, group_keys=False)
.apply(f)
.groupby(['typ', 'sub_typ', 'col_sort', 'col_if'])
.head(3))
print (df_result)
typ sub_typ col_if col_sort value
242 a s1 A 0 709
535 a s1 A 0 710
589 a s1 A 0 801
111 a s1 A 1 542
209 a s1 A 1 557
.. .. ... ... ... ...
39 c s4 D 1 555
211 c s4 D 1 233
13 c s4 D 0 501
614 c s4 D 0 436
658 c s4 D 0 695
[651 rows x 5 columns]
You wrote that col_if should act as a "switch" to the sort order.
But note that each group (at least for your seeding of random) contains
both A and D in col_sort column, so your requirement is ambiguous.
One of possible solutions is to perform a "majority vote" in each group,
i.e. the sort order in particular group is to be ascending if there are
more or equal A values than D. Note that I arbitrarily chose the
ascending order in the "equal" case, maybe you should take the other option.
A doubtful point in your requirements (and hence the code) is that you
put .head(3) after the group processing.
This way you get first 3 rows from the first group only.
Maybe you want 3 initial rows from each group?
In this case head(3) should be inside the lambda function (as I wrote
below).
So change your code to:
df.groupby(['typ', 'sub_typ']).apply(lambda x: x.sort_values('col_sort',
ascending=(x.col_if.eq('A').sum() >= x.col_if.eq('D').sum())).head(3))
As you can see, the sort order can be expressed as a bool expression for
ascending, instead of 2 similar expressions differing only in ascending
parameter.

Selecting values with Pandas multiindex using lists of tuples

I have a DataFrame with a MultiIndex with 3 levels:
id foo bar col1
0 1 a -0.225873
2 a -0.275865
2 b -1.324766
3 1 a -0.607122
2 a -1.465992
2 b -1.582276
3 b -0.718533
7 1 a -1.904252
2 a 0.588496
2 b -1.057599
3 a 0.388754
3 b -0.940285
Preserving the id index level, I want to sum along the foo and bar levels, but with different values for each id.
For example, for id = 0 I want to sum over foo = [1] and bar = [["a", "b"]], for id = 3 I want to sum over foo = [2] and bar = [["a", "b"]], and for id = 7 I want to sum over foo = [[1,2]] and bar = [["a"]]. Giving the result:
id col1
0 -0.225873
3 -3.048268
7 -1.315756
I have been trying something along these lines:
df.loc(axis = 0)[[(0, 1, ["a","b"]), (3, 2, ["a","b"]), (7, [1,2], "a")].sum()
Not sure if this is even possible. Any elegant solution (possibly removing the MultiIndex?) would be much appreciated!
The list of tuples is not the problem. The fact that each tuple does not correspond to a single index is the problem (Since a list isn't a valid key). If you want to index a Dataframe like this, you need to expand the lists inside each tuple to their own entries.
Define your options like the following list of dictionaries, then transform using a list comprehension and index using all individual entries.
d = [
{
'id': 0,
'foo': [1],
'bar': ['a', 'b']
},
{
'id': 3,
'foo': [2],
'bar': ['a', 'b']
},
{
'id': 7,
'foo': [1, 2],
'bar': ['a']
},
]
all_idx = [
(el['id'], i, j)
for el in d
for i in el['foo']
for j in el['bar']
]
# [(0, 1, 'a'), (0, 1, 'b'), (3, 2, 'a'), (3, 2, 'b'), (7, 1, 'a'), (7, 2, 'a')]
df.loc[all_idx].groupby(level=0).sum()
col1
id
0 -0.225873
3 -3.048268
7 -1.315756
A more succinct solution using slicers:
sections = [(0, 1, slice(None)), (3, 2, slice(None)), (7, slice(1,2), "a")]
pd.concat(df.loc[s] for s in sections).groupby("id").sum()
col1
id
0 -0.225873
3 -3.048268
7 -1.315756
Two things to note:
This may be less memory-efficient than the accepted answer since pd.concat creates a new DataFrame.
The slice(None)'s are mandatory, otherwise the index columns of the df.loc[s]'s mismatch when calling pd.concat.