following line take lot of time to update since it has nearly 2.5l records are present - pandas

In one dataframe i took group count which is more than one,need to update those index sepcific column value since its 2.5l it is failing with memory error is there any fast solution for it?
gl_no=primary.groupby('GL Account').filter(lambda x:len(x)>1)
primary_index=primary[primary['GL Account'].isin(gl_no['GL Account'])].index
primary.loc[primary_index]['Cost Element']='01'
primary.loc[primary_index]['GL Acc Type']='P'

You can use GroupBy.transform with GroupBy.size and comparing for boolean mask and set new values by boolean indexing with DataFrame.loc:
primary = pd.DataFrame({
'Cost Element':list('abcdef'),
'GL Acc Type':list('abcdef'),
'GL Account':list('aadbbc')
})
print (primary)
Cost Element GL Acc Type GL Account
0 a a a
1 b b a
2 c c d
3 d d b
4 e e b
5 f f c
mask=primary.groupby('GL Account')['GL Account'].transform('size') > 1
primary.loc[mask, ['Cost Element','GL Acc Type']] = ['01', 'P']
print (primary)
Cost Element GL Acc Type GL Account
0 01 P a
1 01 P a
2 c c d
3 01 P b
4 01 P b
5 f f c

Related

Map Dictionary of Lists to pd Dataframe Column and Create Repeat Rows Based on n Number of List Contents

I am trying to use the following two components 1) a dictionary of lists and 2) a dataframe column composed of the dictionary keys. I would like to to map n number of values to their corresponding key in the existing pandas column, and create duplicate rows based on the number of list contents. I would like to maintain this as a df and not convert to series.
ex. dictionary
d = {a:['i','ii'],b:['iii','iv'],c:['v','vi','vii']}
ex. dataframe columns
Column1 Column2
0 g a
1 h b
2 i c
desired output:
Column1 Column2 Column3
0 g a i
1 g a ii
2 h b iii
3 h b iv
4 i c v
5 i c vi
6 i c vii
What if another dictionary had to be mapped similarly to these three columns from the output? Say, with the following dictionary:
d2 = {'i':['A'],'ii':['B'],'iii':['C','D'],'iv':['E'],'v':['F'];'vi':[G];'vii':['H','I','J']}
What if the dictionary was in df format?
Any help would be much appreciated! Thank you!
use map to create a new column and then explode the list into rows
df=df.assign(Column3 = df['Column2'].map( d))
df.explode('Column3')
Column1 Column2 Column3
0 g a i
0 g a ii
1 h b iii
1 h b iv
2 i c v
2 i c vi
2 i c vii
follow the same to map to Column 3
df=df.assign(Column4 = df['Column3'].map( d2))
df=df.explode('Column4')
df
Column1 Column2 Column3 Column4
0 g a i A
0 g a ii B
1 h b iii C
1 h b iii D
1 h b iv E
2 i c v F
2 i c vi G
2 i c vii H
2 i c vii I
2 i c vii J

How to modify groups of a grouped pandas dataframe

I have this dataframe:
s = pd.DataFrame({'A': [*'1112222'], 'B': [*'abcdefg'], 'C': [*'ABCDEFG']})
that is like this:
A B C
0 1 a A
1 1 b B
2 1 c C
3 2 d D
4 2 e E
5 2 f F
6 2 g G
I want to do a groupby like this:
groups = s.groupby("A")
for example, the group 2 is:
g2 = groups.get_group("2")
that looks like this:
A B C
3 2 d D
4 2 e E
5 2 f F
6 2 g G
Anyway, I want to do some operation in each group.
Let me show how my final result should be:
A B C D
1 1 b B a=b;A=B
2 1 c C a=c;A=C
4 2 e E d=e;D=E
5 2 f F d=f;F=F
6 2 g G d=g;D=G
Actually, I am dropping the first row in each group but combining it with the other rows of the group to create column C
Any idea how to do this?
Summary of what I want to do in two lines:
I want to do a group by and in each group, I want to drop the first row. I also want to add a column to the whole dataframe that is based on the rows of the group
What I have tried:
In order to solve this, I am going to create a function:
def func(g):
first_row_of_group = g.iloc[0]
g = g.iloc[1:]
g["C"] = g.apply(lambda row: ";".join([f'{a}={b}' for a, b in zip(row, first_row_of_group)]))
return g
Then I am going to do this:
groups.apply(lambda g: func(g))
You can apply a custom function to each group where you add the elements from the first row to the remaining rows and remove it:
def remove_first(x):
first = x.iloc[0]
x = x.iloc[1:]
x['D'] = first['B'] + '=' + x['B'] + ';' + first['C'] + '=' + x['C']
# an equivalent operation
# x['D'] = first.iloc[1] + '=' + x.iloc[:,1] + ';' + first.iloc[2] + '=' + x.iloc[:,2]
return x
s = s.groupby('A').apply(remove_first).droplevel(0)
Output:
A B C D
1 1 b B a=b;A=B
2 1 c C a=c;A=C
4 2 e E d=e;D=E
5 2 f F d=f;D=F
6 2 g G d=g;D=G
Note: The dataframe shown in your question is constructed from
s = pd.DataFrame({'A': [*'1112222'], 'B': [*'abcdefg'], 'C': [*'ABCDEFG']})
but you give a different one as raw input.

How to Create a network graph based a simple Datafrme

I am wondering how I can create an Edge list (from, to) based on this type of data. Both columns are inside a pandas data frame and the type is string.
Name
Co-Workers
A
A,B,C,D
B
A,B,C,D
C
A,B,C,E
D
A,B,D,E
E
C,D,E
And also I want to remove connections like AA BB CC ,....
IIUC, you can explode your data and filter it:
df2 = df.copy()
df2['Co-Workers'] = df['Co-Workers'].str.split(',')
df2 = df2.explode('Co-Workers')
df2[df2['Name'].ne(df2['Co-Workers'])]
output:
Name Co-Workers
0 A B
0 A C
0 A D
1 B A
1 B C
1 B D
2 C A
2 C B
2 C E
3 D A
3 D B
3 D E
4 E C
4 E D
First split the column from string to list of separate values.
Second, explode the column.
Third, create a directional graph.
Process the data by mozway code
And then:
from matplotlib.pyplot import figure
G = nx.from_pandas_edgelist(df2, source='Name', target='Co-Workers')
figure(figsize=(10, 8))
nx_graph = nx.compose(nx.DiGraph(), G)
nx.draw_shell(nx_graph, with_labels=True)
Result graph:

Append two pandas dataframe with different shapes and in for loop using python or pandasql

I have two dataframe such as:
df1:
id A B C D
1 a b c d
1 e f g h
1 i j k l
df2:
id A C D
2 x y z
2 u v w
The final outcome should be:
id A B C D
1 a b c d
1 e f g h
1 i j k l
2 x y z
2 u v w
These tables are generated using for loop from json files. So have to keep on appending these tables one below another.
Note: Two dataframes 'id' column is always different.
My approach:
data is a dataframe in which column 'X' has json data and has and "id" column also.
df1=pd.DataFrame()
for i, row1 in data.head(2).iterrows():
df2= pd.io.json.json_normalize(row1["X"])
df2.columns = df2.columns.map(lambda x: x.split(".")[-1])
df2["id"]=[row1["id"] for i in range(df2.shape[0])]
if len(df1)==0:
df1=df2.copy()
df1=pd.concat((df1,df2), ignore_index=True)
Error: AssertionError: Number of manager items must equal union of block items # manager items: 46, # tot_items: 49
How to solve this using python or pandas sql.
You can use pd.concat to concatenate two dataframes like
>>> pd.concat((df,df1), ignore_index=True)
id A B C D
0 1 a b c d
1 1 e f g h
2 1 i j k l
3 2 x NaN y z
4 2 u NaN v w

How to aggregate a column by a value on another column?

Suppose I have the following df.
df = pd.DataFrame({
'A':['x','y','x','y'],
'B':['a','b','a','b'],
'C':[1,10,100,1000],
'D':['w','v','v','w']
})
A B C D
0 x a 1 w
1 y b 10 v
2 x a 100 v
3 y b 1000 w
I want to group by columns A and B, sum column C, and keep the value from D which is the same row of the maximum group value of C. Like this:
A B C D
x a 101 v
y b 1010 w
So far, I have this:
df.groupby(['A','B']).agg({'C':sum})
A B C
x a 101
y b 1010
What function do I have to aggregate column D with?
You can use DataFrameGroupBy.idxmax for indices of max values of C with loc:
#unique index
df.reset_index(drop=True, inplace=True)
df1 = df.groupby(['A','B'])['C'].agg(['sum', 'idxmax'])
df1['idxmax'] = df.loc[df1['idxmax'], 'D'].values
df1 = df1.rename(columns={'idxmax':'D','sum':'C'}).reset_index()
Similar solution with map:
df1 = df.groupby(['A','B'])['C'].agg(['sum', 'idxmax']).reset_index()
df1['idxmax'] = df1['idxmax'].map(df['D'])
df1 = df1.rename(columns={'idxmax':'D','sum':'C'})
print (df1)
A B C D
0 x a 101 v
1 y b 1010 w
set_index before you group by
df.set_index('D').groupby(['A','B']).C.agg(['sum','idxmax']).\
reset_index().rename(columns={'idxmax':'D','sum':'C'})
Out[407]:
A B C D
0 x a 101 v
1 y b 1010 w