I have a data frame that looks like the following
df = pd.DataFrame({'group':[1,1,2,2,2],'time':[1,2,3,4,5],'C':[6,7,8,9,10]})
group time C
0 1 1 6
1 1 2 7
2 2 3 8
3 2 4 9
4 2 5 10
and I'm looking to label the first element (in terms of time) in each group as True, i.e.:
group time C first_in_group
0 1 1 6 True
1 1 2 7 False
2 2 3 8 True
3 2 4 9 False
4 2 5 10 False
I tried several combinations of groupby, first but did not manage to achieve what I wanted.
Is there an elegant way to do it in Pandas?
Use duplicated:
df['first_in_group'] = ~df.group.duplicated()
OUTPUT:
group time C first_in_group
0 1 1 6 True
1 1 2 7 False
2 2 3 8 True
3 2 4 9 False
4 2 5 10 False
NOTE: Do the sorting 1st (if required).
df = df.sort_values(['group', 'time'])
I have a dataframe like so..
id time status
-- ---- ------
a 1 T
a 2 F
b 1 T
b 2 T
a 3 T
a 4 T
b 3 F
b 4 T
b 5 T
I would like to fetch the ids that continuously have the status 'T' for a certain threshold number of times (say 2 in this case).
Thus the fetched rows would be...
id time status
-- ---- ------
b 1 T
b 2 T
a 3 T
a 4 T
b 4 T
b 5 T
I can think of an iterative solution. What I am looking for is something more pandas/sql like. I think an order by id and then time followed by a group by first by id and then status should work, but I'd like to be sure.
Compare values by Series.eq for T and count consecutive values with Series.shift and Series.cumsum, count by Series.value_counts and Series.map to original - get counts per consecutive groups. Then compare by Series.ge and last filter by boolean indexing chain both mask by bitwise AND:
N = 2
m1 = df['status'].eq('T')
g = df['status'].ne(df['status'].shift()).cumsum()
m2 = g.map(g.value_counts()).ge(N)
df = df[m1 & m2]
print (df)
id time status
2 b 1 T
3 b 2 T
4 a 3 T
5 a 4 T
7 b 4 T
8 b 5 T
Details:
print (df.assign(m1=m1, g=g, counts=g.map(g.value_counts()), m2=m2))
id time status m1 g counts m2
0 a 1 T True 1 1 False
1 a 2 F False 2 1 False
2 b 1 T True 3 4 True
3 b 2 T True 3 4 True
4 a 3 T True 3 4 True
5 a 4 T True 3 4 True
6 b 3 F False 4 1 False
7 b 4 T True 5 2 True
8 b 5 T True 5 2 True
For the following data:
index bond stock investor_bond inverstor_stock
0 1 2 A B
1 1 2 A E
2 1 2 A F
3 1 2 B B
4 1 2 B E
5 1 2 B F
6 1 3 A A
7 1 3 A E
8 1 3 A G
9 1 3 B A
10 1 3 B E
11 1 3 B G
12 2 4 C F
13 2 4 C A
14 2 4 C C
15 2 5 B E
16 2 5 B B
17 2 5 B H
bond1 has two investors, A,B. stock2 has three investors, B,E,F. For each investor pair (investor_bond, investor_stock), we want to filter it out if they had ever invested in the same bond/stock.
For example, for a pair of (B,F) of index=5, we want to filter it out because both of them invested in stock 2.
Sample output should be like:
index bond stock investor_bond investor_stock
11 1 3 B G
So far I have tried using two loops.
A1 = A1.groupby('bond').apply(lambda x: x[~x.investor_stock.isin(x.bond)]).reset_index(drop=True)
stock_list=A1.groupby(['bond','stock']).apply(lambda x: x.investor_stock.unique()).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
stock_list=stock_list.groupby('bond').apply(lambda x: list(x.s)).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
A1=pd.merge(A1,stock_list,on='bond',how='left')
A1['in_out']=False
for j in range(0,len(A1)):
for i in range (0,len(A1.s[j])):
A1['in_out'] = A1.in_out | (
A1.investor_bond.isin(A1.s[j][i]) & A1.investor_stock.isin(A1.s[j][i]))
print(j)
The loop is running forever due to the data size, and I am seeking a faster way.
I have a dataframe df defined like so:
A B C D E F
0 a z l 1 qqq True
1 a z l 2 qqq True
2 a z l 3 qqq False
3 a z r 1 www True
4 a z r 2 www False
5 a z r 2 www False
6 s x 7 2 eee True
7 s x 7 3 eee False
8 s x 7 4 eee True
9 s x 5 1 eee True
10 d c l 1 rrr True
11 d c l 2 rrr False
12 d c r 1 fff False
13 d c r 2 fff True
14 d c r 3 fff True
My goal is to create a table based on the unique values of columns A, B and C so that I am able to count the number of elements of column D and the unique number of elements in column C.
The output looks like this:
D E
A B
a z 6 2
d c 5 2
s x 4 2
Where for example the 6 is how many elements are present in the column A having value a, and 2 indicates the number of unique elements in column E (qqq,wwww).
I was able to achgieve this goal by using the following lines of code:
# Define dataframe
df = pd.DataFrame({'A':['a','a','a','a','a','a','s','s','s','s','d','d','d','d','d'],
'B': ['z','z','z','z','z','z','x','x','x','x','c','c','c','c','c'],
'C': ['l','l','l','r','r','r','7','7','7','5','l','l','r','r','r'],
'D': ['1','2','3','1','2','2','2','3','4','1','1','2','1','2','3'],
'E': ['qqq','qqq','qqq','www','www','www','eee','eee','eee','eee','rrr','rrr','fff','fff','fff'],
'F': [True,True,False,True,False,False,True,False,True,True,True,False,False,True,True]})
# My code so far
a = df.pivot_table(index=['A','B','C'], aggfunc={'E':'nunique', 'D':'count'}).sort_values(by='E')
a = a.pivot_table(index=['A','B'], aggfunc='sum').sort_values(by='E')
The Problem:
Now I would like also to count the number of True or False values present in the dataframe with the same criteria presented before so that the result looks like this:
D E True False
A B
a z 6 2 3 3
d c 5 2 3 2
s x 4 2 3 1
As you can see the number of True values where A=a are 3 and False values are 3 as well.
What is a smart and elegant way to achieve my final goal?
Using your code, you could extend like this:
# My code so far
a = df.pivot_table(index=['A','B','C'], aggfunc={'E':'nunique', 'D':'count','F':sum}).sort_values(by='E').rename(columns={'F':'F_True'})
a = a.pivot_table(index=['A','B'], aggfunc='sum').sort_values(by='E').eval('F_False = D - F_True')
OUtput:
D E F_True F_False
A B
a z 6 2 3.0 3.0
d c 5 2 3.0 2.0
s x 4 2 3.0 1.0
You just need two steps
pd.concat([df.groupby(['A','B','C']).agg({'E': 'nunique', 'D':'size'}).sum(level=[0,1])
,df.groupby(['A','B']).F.value_counts().unstack()],1)
Out[702]:
E D False True
A B
a z 2 6 3 3
d c 2 5 2 3
s x 2 4 1 3
Using value_counts
df.groupby(['A','B']).F.value_counts().unstack()
I have 2 data frames like :
df_out:
a b c d
1 1 2 1
2 1 2 3
3 1 3 5
df_fin:
a e f g
1 0 2 1
2 5 2 3
3 1 3 5
5 2 4 6
7 3 2 5
I want to get result as :
a b c d a e f g
1 1 2 1 1 0 2 1
2 1 2 3 2 5 2 3
3 1 3 5 3 1 3 5
in the other word I have two diffrent data frames that are common in one column(a), I want two compare this two columns(df_fin.a and df_out.a) and select the rows from df_fin that have the same value in column a and create new dataframe that has selected rows from df_fin and added columns from df_out ?
I think you need merge with left join:
df = pd.merge(df_out, df_fin, on='a', how='left')
print (df)
a b c d e f g
0 1 1 2 1 0 2 1
1 2 1 2 3 5 2 3
2 3 1 3 5 1 3 5
EDIT:
df1 = df_fin[df_fin['a'].isin(df_out['a'])]
df2 = df_out.join(df1.set_index('a'), on='a')
print (df2)
a b c d e f g
0 1 1 2 1 0 2 1
1 2 1 2 3 5 2 3
2 3 1 3 5 1 3 5