Pandas count occurrence within column on condition being satisfied - pandas

I am trying to do count by grouping. see below input and output.
input:
df = pd.DataFrame()
df['col1'] = ['a','a','a','a','b','b','b']
df['col2'] = [4,4,5,5,6,7,8]
df['col3'] = [1,1,1,1,1,1,1]
output:
col4
0 2
1 2
2 2
3 2
4 1
5 1
6 1
Tried playing around with groupby and count, by doing:
s = df.groupby(['col1','col2'])['col3'].sum()
and the output I got was
a 4 2
5 2
b 6 1
7 1
8 1
how do I add it just as a column on the main df.
Thanks vm!

Use transform len or size:
df['count'] = df.groupby(['col1','col2'])['col3'].transform(len)
print (df)
col1 col2 col3 count
0 a 4 1 2
1 a 4 1 2
2 a 5 1 2
3 a 5 1 2
4 b 6 1 1
5 b 7 1 1
6 b 8 1 1
df['count'] = df.groupby(['col1','col2'])['col3'].transform('size')
print (df)
col1 col2 col3 count
0 a 4 1 2
1 a 4 1 2
2 a 5 1 2
3 a 5 1 2
4 b 6 1 1
5 b 7 1 1
6 b 8 1 1
But column col3 is not necessary, you can use col1 or col2:
df = pd.DataFrame()
df['col1'] = ['a','a','a','a','b','b','b']
df['col2'] = [4,4,5,5,6,7,8]
df['count'] = df.groupby(['col1','col2'])['col1'].transform(len)
df['count1'] = df.groupby(['col1','col2'])['col2'].transform(len)
print (df)
col1 col2 count count1
0 a 4 2 2
1 a 4 2 2
2 a 5 2 2
3 a 5 2 2
4 b 6 1 1
5 b 7 1 1
6 b 8 1 1

try this,
df['count'] = df.groupby(['col1','col2'])['col3'].transform(sum)
print (df)
col1 col2 col3 count
0 a 4 1 2
1 a 4 1 2
2 a 5 1 2
3 a 5 1 2
4 b 6 1 1
5 b 7 1 1
6 b 8 1 1

Related

Pandas shift logic

I have a dataframe like:
col1 customer
1 a
3 a
1 b
2 b
3 b
5 b
I want the logic to be like this:
col1 customer col2
1 a 1
3 a 1
1 b 1
2 b 2
3 b 3
5 b 3
as you can see, if the customer has consistent values in col1, give it, if not, give the last consistent number which is 3
I tried using the df.shift() but I was stuck
Further Example:
col1
1
1
1
3
5
8
10
he should be given a value of 1 because that's the last consistent value for him!
Update
If you have more than one month, you can use this version:
import numpy as np
inc_count = lambda x: np.where(x.diff(1) == 1, x, x.shift(fill_value=x.iloc[0]))
df['col2'] = df.groupby('customer')['col1'].transform(inc_count)
print(df)
# Output
col1 customer col2
0 1 a 1
1 3 a 1
2 1 b 1
3 2 b 2
4 3 b 3
5 5 b 3
Maybe you want to increment a counter if the next row value following the current one:
# Same as df['col1'].diff().eq(1).cumsum().add(1)
df['col2'] = df['col1'].eq(df['col1'].shift()+1).cumsum().add(1)
print(df)
# Output
col1 customer col2
0 1 a 1
1 3 a 1
2 1 b 1
3 2 b 2
4 3 b 3
5 5 b 3
Or for each customer:
inc_count = lambda x: x.eq(x.shift()+1).cumsum().add(1)
df['col2'] = df['col2'] = df.groupby('customer')['col1'].transform(inc_count)
print(df)
# Output
col1 customer col2
0 1 a 1
1 3 a 1
2 1 b 1
3 2 b 2
4 3 b 3
5 5 b 3

How to compute column sum on the basis of other column value in pandas dataframe?

P
T1
T2
T3
0
1
2
3
1
1
2
0
2
3
1
2
3
1
0
2
In the above pandas dataframe df,
I want to add columns on the basis of the value of column 'P'.
if df['P'] == 0: 0
if df['P'] == 1: T1 (=1)
if df['P'] == 2: T1+T2 (=3+1=4)
if df['P'] == 3: T1+T2+T3 (=1+0+2=3)
In other words, I want to add from T1 to TN if df['P'] == N.
How can I implement this with Python code?
EDIT:
For sum values by P column create mask by broadcasting np.arange by length of filtered columns by DataFrame.filter, compare by P values and this mask pass to DataFrame.where, last use sum per rows:
np.random.seed(20)
c = [f'{x}{i + 1}' for x in ['T','U','V'] for i in range(3)]
df = pd.DataFrame(np.random.randint(4, size=(10,10)), columns=['P'] + c)
arrP = df['P'].to_numpy()[:, None]
for c in ['T','U','V']:
df1 = df.filter(regex=rf'^{c}')
df[f'{c}_SUM'] = df1.where(np.arange(len(df1.columns)) < arrP, 0).sum(axis=1)
print (df)
P T1 T2 T3 U1 U2 U3 V1 V2 V3 T_SUM U_SUM V_SUM
0 3 2 3 3 0 2 1 0 3 2 8 3 5
1 3 2 0 2 0 1 2 2 3 3 4 3 8
2 0 1 2 2 2 0 1 1 3 1 0 0 0
3 3 2 2 2 1 3 2 1 3 2 6 6 6
4 3 1 1 3 1 2 2 0 2 3 5 5 5
5 2 3 2 3 1 1 1 0 3 0 5 2 3
6 2 3 2 3 3 3 2 1 1 2 5 6 2
7 3 2 0 2 1 1 2 2 2 3 4 4 7
8 2 2 1 0 2 2 0 3 3 0 3 4 6
9 2 2 3 2 2 3 2 2 1 1 5 5 3

Pandas concat function with count assigned for each iteration

At the replication of a dataframe using concat with index (see example here), is there a way I can assign a count variable for each iteration in column c (where column c is the count variable)?
Orig df:
a
b
0
1
2
1
2
3
df replicated with pd.concat[df]*5 and with an additional Column c:
a
b
c
0
1
2
1
1
2
3
1
0
1
2
2
1
2
3
2
0
1
2
3
1
2
3
3
0
1
2
4
1
2
3
4
0
1
2
5
1
2
3
5
This is a multi-row dataframe where the count variable would have to be applied to multiple rows.
Thanks for your thoughts!
You could use np.arange and np.repeat:
N = 5
new_df = pd.concat([df] * N)
new_df['c'] = np.repeat(np.arange(N), df.shape[0]) + 1
Output:
>>> new_df
a b c
0 1 2 1
1 2 3 1
0 1 2 2
1 2 3 2
0 1 2 3
1 2 3 3
0 1 2 4
1 2 3 4
0 1 2 5
1 2 3 5

Pandas count values inside dataframe

I have a dataframe that looks like this:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
and I want to count the values so to make df like this:
total
1 2
3 2
4 1
5 2
8 2
is it possible with pandas?
With np.unique -
In [332]: df
Out[332]:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
In [333]: ids, c = np.unique(df.values.ravel(), return_counts=1)
In [334]: pd.DataFrame({'total':c}, index=ids)
Out[334]:
total
1 2
3 2
4 1
5 2
8 2
With pandas-series -
In [357]: pd.Series(np.ravel(df)).value_counts().sort_index()
Out[357]:
1 2
3 2
4 1
5 2
8 2
dtype: int64
You can also use stack() and groupby()
df = pd.DataFrame({'A':[1,8,3],'B':[5,4,3],'C':[5,8,1]})
print(df)
A B C
0 1 5 5
1 8 4 8
2 3 3 1
df1 = df.stack().reset_index(1)
df1.groupby(0).count()
level_1
0
1 2
3 2
4 1
5 2
8 2
Other alternative may be to use stack, followed by value_counts then, result changed to frame and finally sorting the index:
count_df = df.stack().value_counts().to_frame('total').sort_index()
count_df
Result:
total
1 2
3 2
4 1
5 2
8 2
using np.unique(, return_counts=True) and np.column_stack():
pd.DataFrame(np.column_stack(np.unique(df, return_counts=True)))
returns:
0 1
0 1 2
1 3 2
2 4 1
3 5 2
4 8 2

compare two column of two dataframe pandas

I have 2 data frames like :
df_out:
a b c d
1 1 2 1
2 1 2 3
3 1 3 5
df_fin:
a e f g
1 0 2 1
2 5 2 3
3 1 3 5
5 2 4 6
7 3 2 5
I want to get result as :
a b c d a e f g
1 1 2 1 1 0 2 1
2 1 2 3 2 5 2 3
3 1 3 5 3 1 3 5
in the other word I have two diffrent data frames that are common in one column(a), I want two compare this two columns(df_fin.a and df_out.a) and select the rows from df_fin that have the same value in column a and create new dataframe that has selected rows from df_fin and added columns from df_out ?
I think you need merge with left join:
df = pd.merge(df_out, df_fin, on='a', how='left')
print (df)
a b c d e f g
0 1 1 2 1 0 2 1
1 2 1 2 3 5 2 3
2 3 1 3 5 1 3 5
EDIT:
df1 = df_fin[df_fin['a'].isin(df_out['a'])]
df2 = df_out.join(df1.set_index('a'), on='a')
print (df2)
a b c d e f g
0 1 1 2 1 0 2 1
1 2 1 2 3 5 2 3
2 3 1 3 5 1 3 5