How to plot values of my columns being above a certain treshold? - pandas

I've been stuck with this problem for a while. I have a dataset which looks more or less like this:
Students Subject Mark
1 M F 7 4 3 7
2 I 5 6
3 M F I S 2 3 0
4 M 2 2
5 F M I 5 1
6 I M F 6 2 3
7 I M 7
Now, I want to create a barplot using pandas and seaborn showing how many students:
Have 3 ore more letters in the column "Subject"
Have at least one 3 in the colum "Marks"
Have both things
I tried with:
n_subject = dataset['Subject'].str.count('\w+')
dataset['NumberSubjects']= n_subject
n_over = dataset[dataset.n_subject >= 3.0]
But it does not work and I'm stuck. I'm sure it is a very basic problem but I don't know what to do.

3 or more subjects:
df["Subject"].str.count("\w+") >= 3
Has one or more marks of 3:
df["Mark"].str.count("3") >= 1
Both:
(df["Subject"].str.count("\w+") >= 3) & (df["Mark"].str.count("3") >= 1)
Boolean representation:
Students Subject Mark one two three
0 1 M F 7 4 3 7 False True False
1 2 I 5 6 False False False
2 3 M F I S 2 3 0 True True True
3 4 M 2 2 False False False
4 5 F M I 5 1 True False False
5 6 I M F 6 2 3 True True True
6 7 I M 7 False False False

I am not really sure what should be the barplot representing (summary of Mark?) But here is what you need for filtering purposes. Also, string count counts empty spaces too, but there are multiple ways of handling this. I am just giving you an idea what / how to do it.
>>> m1 = df.Subject.apply(lambda x: len(x.split()) >= 3)
>>> m2 = df.Mark.str.contains('3')
>>> m3 = m1|m2
>>> df[m1]
Students Subject Mark
2 3 M F I S 2 3 0
4 5 F M I 5 1
5 6 I M F 6 2 3
>>> df[m2]
Students Subject Mark
0 1 M F 7 4 3 7
2 3 M F I S 2 3 0
5 6 I M F 6 2 3
>>> df[m3]
Students Subject Mark
0 1 M F 7 4 3 7
2 3 M F I S 2 3 0
4 5 F M I 5 1
5 6 I M F 6 2 3

Related

Label the first element in each groupby

I have a data frame that looks like the following
df = pd.DataFrame({'group':[1,1,2,2,2],'time':[1,2,3,4,5],'C':[6,7,8,9,10]})
group time C
0 1 1 6
1 1 2 7
2 2 3 8
3 2 4 9
4 2 5 10
and I'm looking to label the first element (in terms of time) in each group as True, i.e.:
group time C first_in_group
0 1 1 6 True
1 1 2 7 False
2 2 3 8 True
3 2 4 9 False
4 2 5 10 False
I tried several combinations of groupby, first but did not manage to achieve what I wanted.
Is there an elegant way to do it in Pandas?
Use duplicated:
df['first_in_group'] = ~df.group.duplicated()
OUTPUT:
group time C first_in_group
0 1 1 6 True
1 1 2 7 False
2 2 3 8 True
3 2 4 9 False
4 2 5 10 False
NOTE: Do the sorting 1st (if required).
df = df.sort_values(['group', 'time'])

Pandas: fetch rows that continuously have a similar value

I have a dataframe like so..
id time status
-- ---- ------
a 1 T
a 2 F
b 1 T
b 2 T
a 3 T
a 4 T
b 3 F
b 4 T
b 5 T
I would like to fetch the ids that continuously have the status 'T' for a certain threshold number of times (say 2 in this case).
Thus the fetched rows would be...
id time status
-- ---- ------
b 1 T
b 2 T
a 3 T
a 4 T
b 4 T
b 5 T
I can think of an iterative solution. What I am looking for is something more pandas/sql like. I think an order by id and then time followed by a group by first by id and then status should work, but I'd like to be sure.
Compare values by Series.eq for T and count consecutive values with Series.shift and Series.cumsum, count by Series.value_counts and Series.map to original - get counts per consecutive groups. Then compare by Series.ge and last filter by boolean indexing chain both mask by bitwise AND:
N = 2
m1 = df['status'].eq('T')
g = df['status'].ne(df['status'].shift()).cumsum()
m2 = g.map(g.value_counts()).ge(N)
df = df[m1 & m2]
print (df)
id time status
2 b 1 T
3 b 2 T
4 a 3 T
5 a 4 T
7 b 4 T
8 b 5 T
Details:
print (df.assign(m1=m1, g=g, counts=g.map(g.value_counts()), m2=m2))
id time status m1 g counts m2
0 a 1 T True 1 1 False
1 a 2 F False 2 1 False
2 b 1 T True 3 4 True
3 b 2 T True 3 4 True
4 a 3 T True 3 4 True
5 a 4 T True 3 4 True
6 b 3 F False 4 1 False
7 b 4 T True 5 2 True
8 b 5 T True 5 2 True

pandas groupby apply optimizing a loop

For the following data:
index bond stock investor_bond inverstor_stock
0 1 2 A B
1 1 2 A E
2 1 2 A F
3 1 2 B B
4 1 2 B E
5 1 2 B F
6 1 3 A A
7 1 3 A E
8 1 3 A G
9 1 3 B A
10 1 3 B E
11 1 3 B G
12 2 4 C F
13 2 4 C A
14 2 4 C C
15 2 5 B E
16 2 5 B B
17 2 5 B H
bond1 has two investors, A,B. stock2 has three investors, B,E,F. For each investor pair (investor_bond, investor_stock), we want to filter it out if they had ever invested in the same bond/stock.
For example, for a pair of (B,F) of index=5, we want to filter it out because both of them invested in stock 2.
Sample output should be like:
index bond stock investor_bond investor_stock
11 1 3 B G
So far I have tried using two loops.
A1 = A1.groupby('bond').apply(lambda x: x[~x.investor_stock.isin(x.bond)]).reset_index(drop=True)
stock_list=A1.groupby(['bond','stock']).apply(lambda x: x.investor_stock.unique()).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
stock_list=stock_list.groupby('bond').apply(lambda x: list(x.s)).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
A1=pd.merge(A1,stock_list,on='bond',how='left')
A1['in_out']=False
for j in range(0,len(A1)):
for i in range (0,len(A1.s[j])):
A1['in_out'] = A1.in_out | (
A1.investor_bond.isin(A1.s[j][i]) & A1.investor_stock.isin(A1.s[j][i]))
print(j)
The loop is running forever due to the data size, and I am seeking a faster way.

Count Boolean values from Pivot table with pandas

I have a dataframe df defined like so:
A B C D E F
0 a z l 1 qqq True
1 a z l 2 qqq True
2 a z l 3 qqq False
3 a z r 1 www True
4 a z r 2 www False
5 a z r 2 www False
6 s x 7 2 eee True
7 s x 7 3 eee False
8 s x 7 4 eee True
9 s x 5 1 eee True
10 d c l 1 rrr True
11 d c l 2 rrr False
12 d c r 1 fff False
13 d c r 2 fff True
14 d c r 3 fff True
My goal is to create a table based on the unique values of columns A, B and C so that I am able to count the number of elements of column D and the unique number of elements in column C.
The output looks like this:
D E
A B
a z 6 2
d c 5 2
s x 4 2
Where for example the 6 is how many elements are present in the column A having value a, and 2 indicates the number of unique elements in column E (qqq,wwww).
I was able to achgieve this goal by using the following lines of code:
# Define dataframe
df = pd.DataFrame({'A':['a','a','a','a','a','a','s','s','s','s','d','d','d','d','d'],
'B': ['z','z','z','z','z','z','x','x','x','x','c','c','c','c','c'],
'C': ['l','l','l','r','r','r','7','7','7','5','l','l','r','r','r'],
'D': ['1','2','3','1','2','2','2','3','4','1','1','2','1','2','3'],
'E': ['qqq','qqq','qqq','www','www','www','eee','eee','eee','eee','rrr','rrr','fff','fff','fff'],
'F': [True,True,False,True,False,False,True,False,True,True,True,False,False,True,True]})
# My code so far
a = df.pivot_table(index=['A','B','C'], aggfunc={'E':'nunique', 'D':'count'}).sort_values(by='E')
a = a.pivot_table(index=['A','B'], aggfunc='sum').sort_values(by='E')
The Problem:
Now I would like also to count the number of True or False values present in the dataframe with the same criteria presented before so that the result looks like this:
D E True False
A B
a z 6 2 3 3
d c 5 2 3 2
s x 4 2 3 1
As you can see the number of True values where A=a are 3 and False values are 3 as well.
What is a smart and elegant way to achieve my final goal?
Using your code, you could extend like this:
# My code so far
a = df.pivot_table(index=['A','B','C'], aggfunc={'E':'nunique', 'D':'count','F':sum}).sort_values(by='E').rename(columns={'F':'F_True'})
a = a.pivot_table(index=['A','B'], aggfunc='sum').sort_values(by='E').eval('F_False = D - F_True')
OUtput:
D E F_True F_False
A B
a z 6 2 3.0 3.0
d c 5 2 3.0 2.0
s x 4 2 3.0 1.0
You just need two steps
pd.concat([df.groupby(['A','B','C']).agg({'E': 'nunique', 'D':'size'}).sum(level=[0,1])
,df.groupby(['A','B']).F.value_counts().unstack()],1)
Out[702]:
E D False True
A B
a z 2 6 3 3
d c 2 5 2 3
s x 2 4 1 3
Using value_counts
df.groupby(['A','B']).F.value_counts().unstack()

compare two column of two dataframe pandas

I have 2 data frames like :
df_out:
a b c d
1 1 2 1
2 1 2 3
3 1 3 5
df_fin:
a e f g
1 0 2 1
2 5 2 3
3 1 3 5
5 2 4 6
7 3 2 5
I want to get result as :
a b c d a e f g
1 1 2 1 1 0 2 1
2 1 2 3 2 5 2 3
3 1 3 5 3 1 3 5
in the other word I have two diffrent data frames that are common in one column(a), I want two compare this two columns(df_fin.a and df_out.a) and select the rows from df_fin that have the same value in column a and create new dataframe that has selected rows from df_fin and added columns from df_out ?
I think you need merge with left join:
df = pd.merge(df_out, df_fin, on='a', how='left')
print (df)
a b c d e f g
0 1 1 2 1 0 2 1
1 2 1 2 3 5 2 3
2 3 1 3 5 1 3 5
EDIT:
df1 = df_fin[df_fin['a'].isin(df_out['a'])]
df2 = df_out.join(df1.set_index('a'), on='a')
print (df2)
a b c d e f g
0 1 1 2 1 0 2 1
1 2 1 2 3 5 2 3
2 3 1 3 5 1 3 5