Selecting values with Pandas multiindex using lists of tuples - pandas

I have a DataFrame with a MultiIndex with 3 levels:
id foo bar col1
0 1 a -0.225873
2 a -0.275865
2 b -1.324766
3 1 a -0.607122
2 a -1.465992
2 b -1.582276
3 b -0.718533
7 1 a -1.904252
2 a 0.588496
2 b -1.057599
3 a 0.388754
3 b -0.940285
Preserving the id index level, I want to sum along the foo and bar levels, but with different values for each id.
For example, for id = 0 I want to sum over foo = [1] and bar = [["a", "b"]], for id = 3 I want to sum over foo = [2] and bar = [["a", "b"]], and for id = 7 I want to sum over foo = [[1,2]] and bar = [["a"]]. Giving the result:
id col1
0 -0.225873
3 -3.048268
7 -1.315756
I have been trying something along these lines:
df.loc(axis = 0)[[(0, 1, ["a","b"]), (3, 2, ["a","b"]), (7, [1,2], "a")].sum()
Not sure if this is even possible. Any elegant solution (possibly removing the MultiIndex?) would be much appreciated!

The list of tuples is not the problem. The fact that each tuple does not correspond to a single index is the problem (Since a list isn't a valid key). If you want to index a Dataframe like this, you need to expand the lists inside each tuple to their own entries.
Define your options like the following list of dictionaries, then transform using a list comprehension and index using all individual entries.
d = [
{
'id': 0,
'foo': [1],
'bar': ['a', 'b']
},
{
'id': 3,
'foo': [2],
'bar': ['a', 'b']
},
{
'id': 7,
'foo': [1, 2],
'bar': ['a']
},
]
all_idx = [
(el['id'], i, j)
for el in d
for i in el['foo']
for j in el['bar']
]
# [(0, 1, 'a'), (0, 1, 'b'), (3, 2, 'a'), (3, 2, 'b'), (7, 1, 'a'), (7, 2, 'a')]
df.loc[all_idx].groupby(level=0).sum()
col1
id
0 -0.225873
3 -3.048268
7 -1.315756

A more succinct solution using slicers:
sections = [(0, 1, slice(None)), (3, 2, slice(None)), (7, slice(1,2), "a")]
pd.concat(df.loc[s] for s in sections).groupby("id").sum()
col1
id
0 -0.225873
3 -3.048268
7 -1.315756
Two things to note:
This may be less memory-efficient than the accepted answer since pd.concat creates a new DataFrame.
The slice(None)'s are mandatory, otherwise the index columns of the df.loc[s]'s mismatch when calling pd.concat.

Related

Adding and updating a pandas column based on conditions of other columns

So I have a dataframe of over 1 million rows
One column called 'activity', which has numbers from 1 - 12.
I added a new empty column called 'label'
The column 'label' needs to be filled with 0 or 1, based on the values of the column 'activity'
So if activity is 1, 2, 3, 6, 7, 8 label will be 0, otherwise it will be 1
Here is what I am currently doing:
df = pd.read_csv('data.csv')
df['label'] = ''
for index, row in df.iterrows():
if (row['activity'] == 1 or row['activity'] == 2 or row['activity'] == 3 or row['activity'] == 6 row['activity'] == 7 or row['activity'] == 8):
df.loc[index, 'label'] == 0
else:
df.loc[index, 'label'] == 1
df.to_cvs('data.csv', index = False)
This is very inefficient, and takes too long to run. Is there any optimizations? Possible use of numpy arrays? And any way to make the code cleaner?
Use numpy.where with Series.isin:
df['label'] = np.where(df['activity'].isin([1, 2, 3, 6, 7, 8]), 0, 1)
Or True, False mapping to 0, 1 by inverting mask:
df['label'] = (~df['activity'].isin([1, 2, 3, 6, 7, 8])).astype(int)

How to modify dataframe based on column values

I want to add relationships to column 'relations' based on rel_list. Specifically, for each tuple, i.e. ('a', 'b'), I want to replace the relationships column value '' with 'b' in the first row, but no duplicate, meaning that for the 2nd row, don't replace '' with 'a', since they are considered as duplicated. The following code doesn't work fully correct:
import pandas as pd
data = {
"names": ['a', 'b', 'c', 'd'],
"ages": [50, 40, 45, 20],
"relations": ['', '', '', '']
}
rel_list = [('a', 'b'), ('a', 'c'), ('c', 'd')]
df = pd.DataFrame(data)
for rel_tuple in rel_list:
head = rel_tuple[0]
tail = rel_tuple[1]
df.loc[df.names == head, 'relations'] = tail
print(df)
The current result of df is:
names ages relations
0 a 50 c
1 b 40
2 c 45 d
3 d 20
However, the correct one is:
names ages relations
0 a 50 b
0 a 50 c
1 b 40
2 c 45 d
3 d 20
There are new rows that need to be added. The 2nd row in this case, like above. How to do that?
You can craft a dataframe and merge:
(df.drop('relations', axis=1)
.merge(pd.DataFrame(rel_list, columns=['names', 'relations']),
on='names',
how='outer'
)
# .fillna('') # uncomment to replace NaN with empty string
)
Output:
names ages relations
0 a 50 b
1 a 50 c
2 b 40 NaN
3 c 45 d
4 d 20 NaN
Instead of updating df you can create a new one and add relations row by row:
import pandas as pd
data = {
"names": ['a', 'b', 'c', 'd'],
"ages": [50, 40, 45, 20],
"relations": ['', '', '', '']
}
rel_list = [('a', 'b'), ('a', 'c'), ('c', 'd')]
df = pd.DataFrame(data)
new_df = pd.DataFrame(data)
new_df.loc[:, 'relations'] = ''
for head, tail in rel_list:
new_row = df[df.names == head]
new_row.loc[:,'relations'] = tail
new_df = new_df.append(new_row)
print(new_df)
Output:
names ages relations
0 a 50
1 b 40
2 c 45
3 d 20
0 a 50 b
0 a 50 c
2 c 45 d
Then, if needed, in the end you can delete all rows without value in 'relations':
new_df = new_df[new_df['relations']!='']

Pandas, groupby include number of rows grouped in each row

Have any way to use
df = pd.read_excel(r'a.xlsx')
df2 = df.groupby(by=["col"], as_index=False).mean()
Include new column with number of rows grouped in each row?
in absence of sample data, I'm assuming you have multiple numeric columns
can use apply() to then calculate all means and append len() to this series
df = pd.DataFrame(
{
"col": np.random.choice(list("ABCD"), 200),
"val": np.random.uniform(1, 5, 200),
"val2": np.random.uniform(5, 10, 200),
}
)
df2 = df.groupby(by=["col"], as_index=False).apply(
lambda d: d.select_dtypes("number").mean().append(pd.Series({"len": len(d)}))
)
df2
col
val
val2
len
0
A
3.13064
7.63837
42
1
B
3.1057
7.50656
44
2
C
3.0111
7.82628
54
3
D
3.20709
7.32217
60
comment code
def w_avg(df, values, weights, exp):
d = df[values]
w = df[weights] ** exp
return (d * w).sum() / w.sum()
dfg1 = pd.DataFrame(
{
"Jogador": np.random.choice(list("ABCD"), 200),
"Evento": np.random.choice(list("XYZ"),200),
"Rating Calculado BW": np.random.uniform(1, 5, 200),
"Lances": np.random.uniform(5, 10, 200),
}
)
dfg = dfg1.groupby(by=["Jogador", "Evento"]).apply(
lambda dfg1: dfg1.select_dtypes("number")
.agg(lambda d: w_avg(dfg1, "Rating Calculado BW", "Lances", 1))
.append(pd.Series({"len": len(dfg1)}))
)
dfg

using agg to flatten a series of lists in pandas

I have a number of multi-index columns each with a list of tuples that I want to flatten (the list, not the tuples) but I'm struggling with it. Here's what I have:
df = pd.DataFrame([[[(1,'a')],[(6,'b')],np.nan,np.nan],[[(5,'d'),(10,'e')],np.nan,np.nan,[(8,'c')]]])
df.columns = pd.MultiIndex.from_tuples([('a', 0), ('a', 1), ('b', 0), ('b', 1)])
>>> df
a b
0 1 0 1
0 [(1, a)] [(6, b)] NaN NaN
1 [(5, d), (10, e)] NaN NaN [(8, c)]
Desired result:
>>> df
a b
0 [(1, a), (6, b)] [NaN, NaN]
1 [(5, d), (10, e), NaN] [NaN, (8, c)]
How do I do this? From this related question, I tried the following:
>>> df.stack(level=1).groupby(level=[0]).agg(lambda x: np.array(list(x)).flatten())
a b
0 a b
1 a b
>>> df.stack(level=1).groupby(level=[0]).agg(lambda x: np.concatenate(list(x)))
...
Exception: Must produce aggregated value
Here's a way to do:
# taken from https://stackoverflow.com/questions/12472338/flattening-a-list-recursively
def flatten(S):
if S == []:
return S
if isinstance(S[0], list):
return flatten(S[0]) + flatten(S[1:])
return S[:1] + flatten(S[1:])
# reshape the data for get the desired structure
df2 = (df
.unstack()
.reset_index()
.drop('level_1', 1)
.groupby(['level_0', 'level_2'])[0]
.apply(list).apply(flatten).unstack().T)
df2.index.name = None
df2.columns.name = None
print(df2)
a b
0 [(1, a), (6, b)] [na, na]
1 [(5, d), (10, e), na] [na, (8, c)]
Found a one-liner:
Using the flatten custom function given by #YOLO
>>> df.stack(level=1).groupby(level=0).agg(list).applymap(flatten)
a b
0 [(1, a), (6, b)] [nan, nan]
1 [(5, d), (10, e), nan] [nan, (8, c)]
where
def flatten(S):
if S == []:
return S
if isinstance(S[0], list):
return flatten(S[0]) + flatten(S[1:])
return S[:1] + flatten(S[1:])

Is there a way to force overlap of two circles?

I would like to draw a Venn Diagram really close to what the R Limma Package does.
In this case I have a set that does not overlap the two others.
R package shows that with "0", but matplolib-venn draws another circle.
edit:
My 3 sets are:
9
7 8 9 10
1 2 3 4 5 6
My code is:
set2 = set([9])
set1 = set([7, 8, 9, 10])
set3 = set([1, 2, 3, 4, 5, 6])
sets = [set1, set2, set3]
lengths = [len(one_set) for one_set in sets]
venn3([set1, set2, set3], ["Group (Total {})".format(length) for (length) in lengths])
Thank you.
R Limma: https://i.ibb.co/h9yhgm1/2019-05-07-Screen-Hunter-06.jpg
matplotlib_venn: https://i.ibb.co/zx6YJbz/2019-05-07-Screen-Hunter-07.jpg
Fred
There is no element that is common to set3 and either set1 or set2. Both diagrams are correct. If you want to show all the spaces, you can try with venn3_unweighted:
from matplotlib_venn import venn3_unweighted
set2 = set([9])
set1 = set([7, 8, 9, 10])
set3 = set([1, 2, 3, 4, 5, 6])
sets = [set1, set2, set3]
lengths = [len(one_set) for one_set in sets]
venn3_unweighted([set1, set2, set3], ["Group (Total {})".format(length) for (length) in lengths])
And the result: