How to modify groups of a grouped pandas dataframe - pandas

I have this dataframe:
s = pd.DataFrame({'A': [*'1112222'], 'B': [*'abcdefg'], 'C': [*'ABCDEFG']})
that is like this:
A B C
0 1 a A
1 1 b B
2 1 c C
3 2 d D
4 2 e E
5 2 f F
6 2 g G
I want to do a groupby like this:
groups = s.groupby("A")
for example, the group 2 is:
g2 = groups.get_group("2")
that looks like this:
A B C
3 2 d D
4 2 e E
5 2 f F
6 2 g G
Anyway, I want to do some operation in each group.
Let me show how my final result should be:
A B C D
1 1 b B a=b;A=B
2 1 c C a=c;A=C
4 2 e E d=e;D=E
5 2 f F d=f;F=F
6 2 g G d=g;D=G
Actually, I am dropping the first row in each group but combining it with the other rows of the group to create column C
Any idea how to do this?
Summary of what I want to do in two lines:
I want to do a group by and in each group, I want to drop the first row. I also want to add a column to the whole dataframe that is based on the rows of the group
What I have tried:
In order to solve this, I am going to create a function:
def func(g):
first_row_of_group = g.iloc[0]
g = g.iloc[1:]
g["C"] = g.apply(lambda row: ";".join([f'{a}={b}' for a, b in zip(row, first_row_of_group)]))
return g
Then I am going to do this:
groups.apply(lambda g: func(g))

You can apply a custom function to each group where you add the elements from the first row to the remaining rows and remove it:
def remove_first(x):
first = x.iloc[0]
x = x.iloc[1:]
x['D'] = first['B'] + '=' + x['B'] + ';' + first['C'] + '=' + x['C']
# an equivalent operation
# x['D'] = first.iloc[1] + '=' + x.iloc[:,1] + ';' + first.iloc[2] + '=' + x.iloc[:,2]
return x
s = s.groupby('A').apply(remove_first).droplevel(0)
Output:
A B C D
1 1 b B a=b;A=B
2 1 c C a=c;A=C
4 2 e E d=e;D=E
5 2 f F d=f;D=F
6 2 g G d=g;D=G
Note: The dataframe shown in your question is constructed from
s = pd.DataFrame({'A': [*'1112222'], 'B': [*'abcdefg'], 'C': [*'ABCDEFG']})
but you give a different one as raw input.

Related

Map Dictionary of Lists to pd Dataframe Column and Create Repeat Rows Based on n Number of List Contents

I am trying to use the following two components 1) a dictionary of lists and 2) a dataframe column composed of the dictionary keys. I would like to to map n number of values to their corresponding key in the existing pandas column, and create duplicate rows based on the number of list contents. I would like to maintain this as a df and not convert to series.
ex. dictionary
d = {a:['i','ii'],b:['iii','iv'],c:['v','vi','vii']}
ex. dataframe columns
Column1 Column2
0 g a
1 h b
2 i c
desired output:
Column1 Column2 Column3
0 g a i
1 g a ii
2 h b iii
3 h b iv
4 i c v
5 i c vi
6 i c vii
What if another dictionary had to be mapped similarly to these three columns from the output? Say, with the following dictionary:
d2 = {'i':['A'],'ii':['B'],'iii':['C','D'],'iv':['E'],'v':['F'];'vi':[G];'vii':['H','I','J']}
What if the dictionary was in df format?
Any help would be much appreciated! Thank you!
use map to create a new column and then explode the list into rows
df=df.assign(Column3 = df['Column2'].map( d))
df.explode('Column3')
Column1 Column2 Column3
0 g a i
0 g a ii
1 h b iii
1 h b iv
2 i c v
2 i c vi
2 i c vii
follow the same to map to Column 3
df=df.assign(Column4 = df['Column3'].map( d2))
df=df.explode('Column4')
df
Column1 Column2 Column3 Column4
0 g a i A
0 g a ii B
1 h b iii C
1 h b iii D
1 h b iv E
2 i c v F
2 i c vi G
2 i c vii H
2 i c vii I
2 i c vii J

How to Create a network graph based a simple Datafrme

I am wondering how I can create an Edge list (from, to) based on this type of data. Both columns are inside a pandas data frame and the type is string.
Name
Co-Workers
A
A,B,C,D
B
A,B,C,D
C
A,B,C,E
D
A,B,D,E
E
C,D,E
And also I want to remove connections like AA BB CC ,....
IIUC, you can explode your data and filter it:
df2 = df.copy()
df2['Co-Workers'] = df['Co-Workers'].str.split(',')
df2 = df2.explode('Co-Workers')
df2[df2['Name'].ne(df2['Co-Workers'])]
output:
Name Co-Workers
0 A B
0 A C
0 A D
1 B A
1 B C
1 B D
2 C A
2 C B
2 C E
3 D A
3 D B
3 D E
4 E C
4 E D
First split the column from string to list of separate values.
Second, explode the column.
Third, create a directional graph.
Process the data by mozway code
And then:
from matplotlib.pyplot import figure
G = nx.from_pandas_edgelist(df2, source='Name', target='Co-Workers')
figure(figsize=(10, 8))
nx_graph = nx.compose(nx.DiGraph(), G)
nx.draw_shell(nx_graph, with_labels=True)
Result graph:

Split Column into Unknown Number of Columns by Delimiter in Pandas Dataframe

I have this table with strings delimiter "+"
ID Products
1 A + B + C + D + E ...
2 A + F + G
3 X + D
I would like to return in this format
ID Products Product 1 Product 2 Product 3 Product 4 Product 5 product...
1 A + B + C + D + E ... A B C D E ...
2 A + F + G A F G
3 X + D X D
1 D + C + C + D + E D C C D E
How I can reproduce this in Pandas Dataframe?
Use Series.str.split with regex '\s+\+\s+' - it means one or more whitesapces, escaped +, one or more whitespaces, then change columns names by DataFrame.add_prefix and last add to original by DataFrame.join:
df1 = df['Products'].str.split('\s+\+\s+', expand=True).add_prefix('Product').fillna('')
df = df.join(df1)
print (df)
ID Products Product0 Product1 Product2 Product3 Product4
0 1 A + B + C + D + E A B C D E
1 2 A + F + G A F G
2 3 X + D X D
Also if necessary change column names:
d = lambda x: f'Product{x+1}'
df = (df.join(df['Products'].str.split('\s+\+\s+', expand=True)
.rename(columns=d)
.fillna('')))
print (df)
ID Products Product1 Product2 Product3 Product4 Product5
0 1 A + B + C + D + E A B C D E
1 2 A + F + G A F G
2 3 X + D X D

Append two pandas dataframe with different shapes and in for loop using python or pandasql

I have two dataframe such as:
df1:
id A B C D
1 a b c d
1 e f g h
1 i j k l
df2:
id A C D
2 x y z
2 u v w
The final outcome should be:
id A B C D
1 a b c d
1 e f g h
1 i j k l
2 x y z
2 u v w
These tables are generated using for loop from json files. So have to keep on appending these tables one below another.
Note: Two dataframes 'id' column is always different.
My approach:
data is a dataframe in which column 'X' has json data and has and "id" column also.
df1=pd.DataFrame()
for i, row1 in data.head(2).iterrows():
df2= pd.io.json.json_normalize(row1["X"])
df2.columns = df2.columns.map(lambda x: x.split(".")[-1])
df2["id"]=[row1["id"] for i in range(df2.shape[0])]
if len(df1)==0:
df1=df2.copy()
df1=pd.concat((df1,df2), ignore_index=True)
Error: AssertionError: Number of manager items must equal union of block items # manager items: 46, # tot_items: 49
How to solve this using python or pandas sql.
You can use pd.concat to concatenate two dataframes like
>>> pd.concat((df,df1), ignore_index=True)
id A B C D
0 1 a b c d
1 1 e f g h
2 1 i j k l
3 2 x NaN y z
4 2 u NaN v w

How to aggregate a column by a value on another column?

Suppose I have the following df.
df = pd.DataFrame({
'A':['x','y','x','y'],
'B':['a','b','a','b'],
'C':[1,10,100,1000],
'D':['w','v','v','w']
})
A B C D
0 x a 1 w
1 y b 10 v
2 x a 100 v
3 y b 1000 w
I want to group by columns A and B, sum column C, and keep the value from D which is the same row of the maximum group value of C. Like this:
A B C D
x a 101 v
y b 1010 w
So far, I have this:
df.groupby(['A','B']).agg({'C':sum})
A B C
x a 101
y b 1010
What function do I have to aggregate column D with?
You can use DataFrameGroupBy.idxmax for indices of max values of C with loc:
#unique index
df.reset_index(drop=True, inplace=True)
df1 = df.groupby(['A','B'])['C'].agg(['sum', 'idxmax'])
df1['idxmax'] = df.loc[df1['idxmax'], 'D'].values
df1 = df1.rename(columns={'idxmax':'D','sum':'C'}).reset_index()
Similar solution with map:
df1 = df.groupby(['A','B'])['C'].agg(['sum', 'idxmax']).reset_index()
df1['idxmax'] = df1['idxmax'].map(df['D'])
df1 = df1.rename(columns={'idxmax':'D','sum':'C'})
print (df1)
A B C D
0 x a 101 v
1 y b 1010 w
set_index before you group by
df.set_index('D').groupby(['A','B']).C.agg(['sum','idxmax']).\
reset_index().rename(columns={'idxmax':'D','sum':'C'})
Out[407]:
A B C D
0 x a 101 v
1 y b 1010 w