How to make a customized count of occurences in a grouped dataframe - pandas

Please find below my input/output :
INPUT :
Id Values Status
0 Id001 red online
1 Id002 brown running
2 Id002 white off
3 Id003 blue online
4 Id003 green valid
5 Id003 yellow running
6 Id004 rose off
7 Id004 purple off
OUTPUT :
Id Values Status Id_occ Val_occ Sta_occ
0 Id001 red online 1 1 1
1 Id002 brown|white running|off 2 2 2
2 Id003 blue|green|yellow online|valid|running 3 3 3
3 Id004 rose|purple off 2 2 1
I was able to re-calculate the columns Values and Status but I don't know how to create the three columns of occurences.
import pandas as pd
df = pd.DataFrame({'Id': ['Id001', 'Id002', 'Id002', 'Id003', 'Id003', 'Id003', 'Id004', 'Id004'],
'Values': ['red', 'brown','white','blue', 'green', 'yellow', 'rose', 'purple'],
'Status': ['online', 'running', 'off', 'online', 'valid', 'running', 'off', 'off']})
out = (df.groupby(['Id'])
.agg({'Values': 'unique', 'Status': 'unique'})
.applymap(lambda x: '|'.join([str(val) for val in list(x)]))
.reset_index()
)
Do you have any suggestions of how to create the three columns of occurences ? Also, is there a better way to re-calculate the columns Values and Status ?

You can use named aggregation and a custom function:
ujoin = lambda s: '|'.join(dict.fromkeys(s))
out = (df
.assign(N_off=df['Status'].eq('off'))
.groupby(['Id'], as_index=False)
.agg(**{'Values': ('Values', ujoin),
'Status': ('Status', ujoin),
'Id_occ': ('Values', 'size'),
'Val_occ': ('Values', 'nunique'),
'Stat_occ': ('Status', 'nunique'),
'N_off': ('N_off', 'sum')
})
)
Output:
Id Values Status Id_occ Val_occ Stat_occ N_off
0 Id001 red online 1 1 1 0
1 Id002 brown|white running|off 2 2 2 1
2 Id003 blue|green|yellow online|valid|running 3 3 3 0
3 Id004 rose|purple off 2 2 1 2

Use:
df.groupby('Id')['Values'].nunique()
for two columns:
df.groupby('Id')[['Values', 'Status']].nunique()
Output:
Values Status
Id
Id001 1 1
Id002 2 2
Id003 3 3
Id004 2 1

Related

how to concatenate text from multiple rows in dataframe based on a specific structure

I am going to merge multiple rows of a dataframe that has a specific structure of a text
For example, I have
df = pd.DataFrame([
(1, 'john', 'merge'),
(1, 'smith,', 'merge'),
(1, 'robert', 'merge'),
(1, 'g', 'merge'),
(1, 'owens,', 'merge'),
(2, 'sarah will', 'OK'),
(2, 'ali kherad', 'OK'),
(2, 'david', 'merge'),
(2, 'lu,', 'merge'),
], columns=['ID', 'Name', 'Merge'])
which is
ID Name Merge
1 john merge
1 smith, merge
1 robert merge
1 g merge
1 owens, merge
2 sarah will OK
2 ali kherad OK
2 david merge
2 lu, merge
The goal is to have a datframe that merges the text in rows like this
ID Name
0 1 john smith
1 1 robert g owens
2 2 sarah will
3 2 ali kherad
4 2 david lu
I found a way to create the column 'Merge' to know if I need to merge or not. Then I tried this
df = pd.DataFrame(df[df['Merge']=='merge'].groupby(['ID','Merge'], axis=0)['Name'].apply(' '.join))
res = df.apply(lambda x: x.str.split(',').explode()).reset_index().drop(['Merge'], axis=1)
First I groupby the names when the column 'Merge' is equal to 'merge'. I know this is not the best way because it only considers this condition but in my dataframe I should have the other rows when the column 'Merge' is equal to 'OK'.
Then I split by ','.
The result is
ID Name
0 1 john smith
1 1 robert g owens
2 1
3 2 david lu
4 2
The other problem is that the order is not correct in my real example when I have more than 4000 rows. How can I keep the order and merge the text when necessary?
make grouper for grouping
cond1 = df['Name'].str.contains('\,$') | df['Merge'].eq('OK')
g = cond1[::-1].cumsum()
g(chk reversed index)
8 1
7 1
6 2
5 3
4 4
3 4
2 4
1 5
0 5
dtype: int32
remove , and groupby by ID and g
out = (df['Name'].str.replace('\,$', '', regex=True)
.groupby([df['ID'], g], sort=False).agg(' '.join)
.droplevel(1).reset_index())
out
ID Name
0 1 john smith
1 1 robert g owens
2 2 sarah will
3 2 ali kherad
4 2 david lu

How to drop row index and flatten index in this way

I have the following dfe :-
id categ level cols value comment
1 A PG Apple 428 comment1
1 A CD Apple 175 comment1
1 C PG Apple 226 comment1
1 C AB Apple 884 comment1
1 C CD Apple 288 comment1
1 B PG Apple 712 comment1
1 B AB Apple 849 comment1
2 B CD Apple 376 comment1
2 C None Orange 591 comment1
2 B CD Orange 135 comment1
2 D None Orange 423 comment1
2 A AB Orange 1e13 comment1
2 D PG Orange 1e15 comment2
df2 = pd.DataFrame({'s2': {0: 1, 1: 2, 2: 3}, `level': {0: 'PG', 1: 'AB', 2: 'CD'}})
df1 = pd.DataFrame({'sl': {0: 1, 1: 2, 2: 3, 3: 4}, 'set': {0: 'A', 1: 'C', 2: 'B', 3: 'D'}})
dfe = (dfe[['categ','level','cols','id','comment','value']]
.merge(df1.rename({'set' : 'categ'}, axis=1),how='left',on='categ')
.merge(df2, how='left', on='level'))
na = dfe['level'].isna()
dfs = {'no_null': dfe[~na], 'null': dfe[na]}
with pd.ExcelWriter('XYZ.xlsx') as writer:
for p,r in dfs.items():
if p== 'no_null':
c= ['cols','s2','level']
else:
c = 'cols'
df = r.pivot_table(index=['id','sl','comment','categ'], columns=c, values=['value'])
df.columns = df.columns.droplevel([0,2])
df = df.reset_index().drop(('sl',''), axis=1).set_index('categ')
for (id,comment), sdf in df.groupby(['id','comment']):
df = sdf.reset_index(level=[1], drop=True).dropna(how='all', axis=1)
df.to_excel(writer,sheet_name=name)
Running this I get results displayed in excel this way :-
I want to order in certain way, what I tried :-
df = r.pivot_table(index=['id','sl','comment','categ'], columns=c, values='value')
df.columns = df.columns.droplevel([1])
df = df.reset_index().drop(('sl',''), axis=1).set_index('categ')
This gives me Too many levels: Index has only 2 levels, not 3 error, I don't know what Im missing /wrong here .
My expected output for arrangement of headings is :-
Would like to know if headings can be written to excel in CAPS as shown in expected output.
EDIT 1
I tried the answer and Im getting this view :-
I want to be able to display ID & COMMENT only once (as its already grouped by ID in code logic), and drop the sl column and the first column 0,1,2 and also delete the blank row above 0
Given dfe as:
categ level cols id comment value sl s2
0 A PG Apple 1 comment1 4.280000e+02 1 1.0
1 A CD Apple 1 comment1 1.750000e+02 1 3.0
2 C PG Apple 1 comment1 2.260000e+02 2 1.0
3 C AB Apple 1 comment1 8.840000e+02 2 2.0
4 C CD Apple 1 comment1 2.880000e+02 2 3.0
5 B PG Apple 1 comment1 7.120000e+02 3 1.0
6 B AB Apple 1 comment1 8.490000e+02 3 2.0
7 B CD Apple 2 comment1 3.760000e+02 3 3.0
8 C None Orange 2 comment1 5.910000e+02 2 NaN
9 B CD Orange 2 comment1 1.350000e+02 3 3.0
10 D None Orange 2 comment1 4.230000e+02 4 NaN
11 A AB Orange 2 comment1 1.000000e+13 1 2.0
12 D PG Orange 2 comment2 1.000000e+15 4 1.0
Then try:
df = dfe.pivot_table(index=['id','comment','categ'], columns=c, values='value')
df.columns = df.columns.droplevel([1])
df = (df.rename_axis(columns=[None, None])
.reset_index(col_level=1)
.rename(columns = lambda x: x.upper()))
df.to_excel('testa1.xlsx')
Output:
Notes:
Removed [] around 'value' in pivot_table to not include 'value' as
a column index.
Aligned 'categ', 'label' and 'comments' with column index level 1 using col_level parameter.
See this post about the blank line, https://stackoverflow.com/a/52498899/6361531.
I think it would be easier to drop columns name and the replace it with a custome one:
df.columns = df.columns.droplevel()
df.columns = pd.MultiIndex.from_tuples([("", "ID"), ("", "CATEG"), ("apple", "PG"), ("apple", "AB"), ("apple", "CD"), ("orange", "PG"), ("orange", "AB"), ("orange", "CD")])

Comparing columns of different size with pandas

I have two dataframes, A and B. I need to create a third column that when the numbers in A matches the numbers in B it writes correct otherwise it marks it a null. any suggestions?
enter image description here
In [100]: A
Out[100]:
a
0 1
1 2
2 3
3 4
4 5
In [101]: B
Out[101]:
a
2 3
3 4
4 5
In [102]: A['Result'] = A['a'].isin(B['a'] ).replace( {False : None , True : 'correct' })
In [103]: A
Out[103]:
a Result
0 1 None
1 2 None
2 3 correct
3 4 correct
4 5 correct

python pandas - set column value of column based on index and or ID of concatenated dataframes

I have a concatenated dataframe of at least two concatenated dataframes:
i.e.
df1
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
df2
Name | Type | ID
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
ConcatDf:
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
Suppose after they are concatenated, I'd like to set Type for all records from df1 to C and all records from df2 to B. Is this possible?
The indices of the dataframes can be vastly different sizes.
Thanks in advance.
df3 = pd.concat([df1,df2], keys = (1,2))
df3.loc[(1), 'Type'] == 'C'
When you concat you can assign the df's keys. This will create a multi-index with the keys separating the concatonated df's. Then when you use .loc with keys you can use( around the key to call the group. In the code above we would change all the Types of df1 (which has a key of 1) to C.
Use merge with indicator=True to find rows belong to df1 or df2. Next, use np.where to assign A or B.
t = concatdf.merge(df1, how='left', on=concatdf.columns.tolist(), indicator=True)
concatdf['Type'] = np.where(t._merge.eq('left_only'), 'B', 'C')
Out[2185]:
Name Type ID
0 Joe C 1
1 Fred C 2
2 Mike C 3
3 Frank C 4
0 Bill B 1
1 Jill B 2
2 Mill B 3
3 Hill B 4

Pandas: expanding_apply with groupby for unique counts of string type

I have the dataframe:
import pandas as pd
id = [0,0,0,0,1,1,1,1]
color = ['red','blue','red','black','blue','red','black','black']
test = pd.DataFrame(zip(id, color), columns = ['id', 'color'])
and would like to create a column of the running count of the unique colors grouped by id so that the final dataframe looks like this:
id color expanding_unique_count
0 0 red 1
1 0 blue 2
2 0 red 2
3 0 black 3
4 1 blue 1
5 1 red 2
6 1 black 3
7 1 black 3
I tried this simple way:
def len_unique(x):
return(len(np.unique(x)))
test['expanding_unique_count'] = test.groupby('id')['color'].apply(lambda x: pd.expanding_apply(x, len_unique))
And got ValueError: could not convert string to float: black
If I change the colors to integers:
color = [1,2,1,3,2,1,3,3]
test = pd.DataFrame(zip(id, color), columns = ['id', 'color'])
Then running the same code above produces the desired result. Is there a way for this to work while maintaining the string type for the column color?
It looks like expanding_apply and rolling_apply mainly work on numeric values. Maybe try creating a numeric column to code the color string as numeric values (this can be done by make color column categorical), and then expanding_apply.
# processing
# ===================================
# create numeric label
test['numeric_label'] = pd.Categorical(test['color']).codes
# output: array([2, 1, 2, 0, 1, 2, 0, 0], dtype=int8)
# your expanding function
test['expanding_unique_count'] = test.groupby('id')['numeric_label'].apply(lambda x: pd.expanding_apply(x, len_unique))
# drop the auxiliary column
test.drop('numeric_label', axis=1)
id color expanding_unique_count
0 0 red 1
1 0 blue 2
2 0 red 2
3 0 black 3
4 1 blue 1
5 1 red 2
6 1 black 3
7 1 black 3
Edit:
def func(group):
return pd.Series(1, index=group.groupby('color').head(1).index).reindex(group.index).fillna(0).cumsum()
test['expanding_unique_count'] = test.groupby('id', group_keys=False).apply(func)
print(test)
id color expanding_unique_count
0 0 red 1
1 0 blue 2
2 0 red 2
3 0 black 3
4 1 blue 1
5 1 red 2
6 1 black 3
7 1 black 3