Count Boolean values from Pivot table with pandas - pandas

I have a dataframe df defined like so:
A B C D E F
0 a z l 1 qqq True
1 a z l 2 qqq True
2 a z l 3 qqq False
3 a z r 1 www True
4 a z r 2 www False
5 a z r 2 www False
6 s x 7 2 eee True
7 s x 7 3 eee False
8 s x 7 4 eee True
9 s x 5 1 eee True
10 d c l 1 rrr True
11 d c l 2 rrr False
12 d c r 1 fff False
13 d c r 2 fff True
14 d c r 3 fff True
My goal is to create a table based on the unique values of columns A, B and C so that I am able to count the number of elements of column D and the unique number of elements in column C.
The output looks like this:
D E
A B
a z 6 2
d c 5 2
s x 4 2
Where for example the 6 is how many elements are present in the column A having value a, and 2 indicates the number of unique elements in column E (qqq,wwww).
I was able to achgieve this goal by using the following lines of code:
# Define dataframe
df = pd.DataFrame({'A':['a','a','a','a','a','a','s','s','s','s','d','d','d','d','d'],
'B': ['z','z','z','z','z','z','x','x','x','x','c','c','c','c','c'],
'C': ['l','l','l','r','r','r','7','7','7','5','l','l','r','r','r'],
'D': ['1','2','3','1','2','2','2','3','4','1','1','2','1','2','3'],
'E': ['qqq','qqq','qqq','www','www','www','eee','eee','eee','eee','rrr','rrr','fff','fff','fff'],
'F': [True,True,False,True,False,False,True,False,True,True,True,False,False,True,True]})
# My code so far
a = df.pivot_table(index=['A','B','C'], aggfunc={'E':'nunique', 'D':'count'}).sort_values(by='E')
a = a.pivot_table(index=['A','B'], aggfunc='sum').sort_values(by='E')
The Problem:
Now I would like also to count the number of True or False values present in the dataframe with the same criteria presented before so that the result looks like this:
D E True False
A B
a z 6 2 3 3
d c 5 2 3 2
s x 4 2 3 1
As you can see the number of True values where A=a are 3 and False values are 3 as well.
What is a smart and elegant way to achieve my final goal?

Using your code, you could extend like this:
# My code so far
a = df.pivot_table(index=['A','B','C'], aggfunc={'E':'nunique', 'D':'count','F':sum}).sort_values(by='E').rename(columns={'F':'F_True'})
a = a.pivot_table(index=['A','B'], aggfunc='sum').sort_values(by='E').eval('F_False = D - F_True')
OUtput:
D E F_True F_False
A B
a z 6 2 3.0 3.0
d c 5 2 3.0 2.0
s x 4 2 3.0 1.0

You just need two steps
pd.concat([df.groupby(['A','B','C']).agg({'E': 'nunique', 'D':'size'}).sum(level=[0,1])
,df.groupby(['A','B']).F.value_counts().unstack()],1)
Out[702]:
E D False True
A B
a z 2 6 3 3
d c 2 5 2 3
s x 2 4 1 3
Using value_counts
df.groupby(['A','B']).F.value_counts().unstack()

Related

dataframe sorting by sum of values

I have the following df:
df = pd.DataFrame({'from':['A','A','A','B','B','C','C','C'],'to':['J','C','F','C','M','Q','C','J'],'amount':[1,1,2,12,13,5,5,1]})
df
and I wish to sort it is such way that the highest amount of 'from' is first. So in this example, 'from' B has 12+13 = 25 so B is the first in the list. Then comes C with 11 and then A with 4.
One way to do it is like this:
df['temp'] = df.groupby(['from'])['amount'].transform('sum')
df.sort_values(by=['temp'], ascending =False)
but I'm just adding another column. Wonder if there's a better way?
I think your method is good and explicit.
A variant without the temporary column could be:
df.sort_values(by='from', ascending=False,
key=lambda x: df['amount'].groupby(x).transform('sum'))
output:
from to amount
3 B C 12
4 B M 13
5 C Q 5
6 C C 5
7 C J 1
0 A J 1
1 A C 1
2 A F 2
In your case do with argsort
out = df.iloc[(-df.groupby(['from'])['amount'].transform('sum')).argsort()]
Out[53]:
from to amount
3 B C 12
4 B M 13
5 C Q 5
6 C C 5
7 C J 1
0 A J 1
1 A C 1
2 A F 2

How to plot values of my columns being above a certain treshold?

I've been stuck with this problem for a while. I have a dataset which looks more or less like this:
Students Subject Mark
1 M F 7 4 3 7
2 I 5 6
3 M F I S 2 3 0
4 M 2 2
5 F M I 5 1
6 I M F 6 2 3
7 I M 7
Now, I want to create a barplot using pandas and seaborn showing how many students:
Have 3 ore more letters in the column "Subject"
Have at least one 3 in the colum "Marks"
Have both things
I tried with:
n_subject = dataset['Subject'].str.count('\w+')
dataset['NumberSubjects']= n_subject
n_over = dataset[dataset.n_subject >= 3.0]
But it does not work and I'm stuck. I'm sure it is a very basic problem but I don't know what to do.
3 or more subjects:
df["Subject"].str.count("\w+") >= 3
Has one or more marks of 3:
df["Mark"].str.count("3") >= 1
Both:
(df["Subject"].str.count("\w+") >= 3) & (df["Mark"].str.count("3") >= 1)
Boolean representation:
Students Subject Mark one two three
0 1 M F 7 4 3 7 False True False
1 2 I 5 6 False False False
2 3 M F I S 2 3 0 True True True
3 4 M 2 2 False False False
4 5 F M I 5 1 True False False
5 6 I M F 6 2 3 True True True
6 7 I M 7 False False False
I am not really sure what should be the barplot representing (summary of Mark?) But here is what you need for filtering purposes. Also, string count counts empty spaces too, but there are multiple ways of handling this. I am just giving you an idea what / how to do it.
>>> m1 = df.Subject.apply(lambda x: len(x.split()) >= 3)
>>> m2 = df.Mark.str.contains('3')
>>> m3 = m1|m2
>>> df[m1]
Students Subject Mark
2 3 M F I S 2 3 0
4 5 F M I 5 1
5 6 I M F 6 2 3
>>> df[m2]
Students Subject Mark
0 1 M F 7 4 3 7
2 3 M F I S 2 3 0
5 6 I M F 6 2 3
>>> df[m3]
Students Subject Mark
0 1 M F 7 4 3 7
2 3 M F I S 2 3 0
4 5 F M I 5 1
5 6 I M F 6 2 3

pandas groupby apply optimizing a loop

For the following data:
index bond stock investor_bond inverstor_stock
0 1 2 A B
1 1 2 A E
2 1 2 A F
3 1 2 B B
4 1 2 B E
5 1 2 B F
6 1 3 A A
7 1 3 A E
8 1 3 A G
9 1 3 B A
10 1 3 B E
11 1 3 B G
12 2 4 C F
13 2 4 C A
14 2 4 C C
15 2 5 B E
16 2 5 B B
17 2 5 B H
bond1 has two investors, A,B. stock2 has three investors, B,E,F. For each investor pair (investor_bond, investor_stock), we want to filter it out if they had ever invested in the same bond/stock.
For example, for a pair of (B,F) of index=5, we want to filter it out because both of them invested in stock 2.
Sample output should be like:
index bond stock investor_bond investor_stock
11 1 3 B G
So far I have tried using two loops.
A1 = A1.groupby('bond').apply(lambda x: x[~x.investor_stock.isin(x.bond)]).reset_index(drop=True)
stock_list=A1.groupby(['bond','stock']).apply(lambda x: x.investor_stock.unique()).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
stock_list=stock_list.groupby('bond').apply(lambda x: list(x.s)).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
A1=pd.merge(A1,stock_list,on='bond',how='left')
A1['in_out']=False
for j in range(0,len(A1)):
for i in range (0,len(A1.s[j])):
A1['in_out'] = A1.in_out | (
A1.investor_bond.isin(A1.s[j][i]) & A1.investor_stock.isin(A1.s[j][i]))
print(j)
The loop is running forever due to the data size, and I am seeking a faster way.

How to remove duplicates the row wherefore two columns matches?

for my graduation project, I would like to remove duplicate rows and keep only a row where column b and c are equal for the value in column a. I tried a lot of things, groupby, Merge combinations and duplicates, but nothing worked out till now. Can you please help me? Many thanks!
input:
a b c
0 1 A B
1 1 A A
2 1 A C
3 2 B A
4 2 B B
result:
a b c
1 1 A A
4 2 B B
I believe you need:
print (df)
a b c
0 1 A B
1 1 A A
2 1 A C
3 2 B A
4 2 B B
5 3 C C
6 4 C NaN
7 4 C E
7 5 NaN E
Replace NaNs by forward and back filling:
df1 = df[['b','c']].bfill(axis=1).ffill(axis=1)
print (df1)
b c
0 A B
1 A A
2 A C
3 B A
4 B B
5 C C
6 C C
7 C E
7 E E
Check condition in df1 and because same index is possible filter df:
df = df[df1['b'] == df1['c']]
print (df)
a b c
1 1 A A
4 2 B B
5 3 C C
6 4 C NaN
7 5 NaN E

Group by with a pandas dataframe using different aggregation for different columns

I have a pandas dataframe df with columns [a, b, c, d, e, f]. I want to perform a group by on df. I can best describe what it's supposed to do in SQL:
SELECT a, b, min(c), min(d), max(e), sum(f)
FROM df
GROUP BY a, b
How do I do this group by using pandas on my dataframe df?
consider df:
a b c d e f
1 1 2 5 9 3
1 1 3 3 4 5
2 2 4 7 4 4
2 2 5 3 8 8
I expect the result to be:
a b c d e f
1 1 2 3 9 8
2 2 4 3 8 12
use agg
df = pd.DataFrame(
dict(
a=list('aaaabbbb'),
b=list('ccddccdd'),
c=np.arange(8),
d=np.arange(8),
e=np.arange(8),
f=np.arange(8),
)
)
funcs = dict(c='min', d='min', e='max', f='sum')
df.groupby(['a', 'b']).agg(funcs).reset_index()
a b c e f d
0 a c 0 1 1 0
1 a d 2 3 5 2
2 b c 4 5 9 4
3 b d 6 7 13 6
with your data
a b c e f d
0 1 1 2 9 8 3
1 2 2 4 8 12 3