Convert an indexed pandas matrix to a flat dataframe - pandas

Given the dataframe:
df = pd.DataFrame([['foo', 123, 4, 5, 0, 1], ['foo', 123, 4, 0, 9, 1], ['bar', 33, 0, 0, 3, 5]], columns=list('ABCDEF'))
[out]:
A B C D E F
0 foo 123 4 5 0 1
1 foo 123 4 0 9 1
2 bar 33 0 0 3 5
The goal is to sum certain columns ('C', 'D', 'E', F') using other columns ('A' and 'B') as keys to achieve:
A B C D E F
0 foo 123 8 5 9 2
2 bar 33 0 0 3 5
I've tried:
df.groupby(['A', 'B']).sum()
[out]:
C D E F
A B
bar 33 0 0 3 5
foo 123 8 5 9 2
How do I change it back to the non-indexed matrix? i.e.
A B C D E F
0 foo 123 8 5 9 2
2 bar 33 0 0 3 5

You need to add .reset_index().
df.groupby(['A','B']).sum().reset_index()
A B C D E F
0 bar 33 0 0 3 5
1 foo 123 8 5 9 2
or
df.set_index(['A','B']).sum(level=[0,1]).reset_index()
A B C D E F
0 bar 33 0 0 3 5
1 foo 123 8 5 9 2

You can use parameter as_index=False for return df:
df1 = df.groupby(['A', 'B'], as_index=False).sum()
print (df1)
A B C D E F
0 bar 33 0 0 3 5
1 foo 123 8 5 9 2

Related

Split and concatenate dataframe

So i have dataframe which looks like this one:
>>>df = pd.DataFrame({
'id': [i for i in range(5)],
'1': ['a', 'b', 'c', 'd', 'e'],
'2': ['f', 'g', 'h', 'i', 'g']
})
>>>df
id 1 2
0 0 a f
1 1 b g
2 2 c h
3 3 d i
4 4 e g
I want to convert this dataframe to following dataframe
>>>df_concatenated
id val
1 0 a
1 1 b
1 2 c
1 3 d
1 4 e
2 0 f
2 1 g
2 2 h
2 3 i
2 4 g
One way is to pd.melt
pd.melt(df, id_vars=['id'], value_vars=['1','2']).set_index('variable',append=True)
The other is by splitting by .loc accessor and concatenating. Long but it works
res1=df.iloc[:,[0,2]]
res1.columns=['id','val']
res=df.iloc[:,:2]
res.columns=['id','val']
res2=pd.concat([res1,res])
res2
variable id value
0 1 0 a
1 1 1 b
2 1 2 c
3 1 3 d
4 1 4 e
5 2 0 f
6 2 1 g
7 2 2 h
8 2 3 i
9 2 4 g
You can try this:
df = df.rename({"1":"val"},axis=1)
df_temp = df[["id","2"]]
df_temp = df_temp.rename({"2":"val"},axis=1)
df.drop("2",axis=1,inplace=True)
out_df = pd.concat([df,df_temp],axis=0).reset_index(drop=True)
print(out_df)
output:
id val
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
5 0 f
6 1 g
7 2 h
8 3 i
9 4 g

How to display nth-row and last row

I need to display nth rows and last one using pandas. I know that nth-rows could be displayed by using iloc
for example:
data = {"x1": [1,2,3,4,5,6,7,8,9,10], "x2": ["a","b","c","d","e","f","g","h","i","j"]}
df = pd.DataFrame(data=data)
a = df.iloc[::2]
print(a)
will display
x1 x2
0 1 a
2 3 c
4 5 e
6 7 g
8 9 i
But I need it to be:
x1 x2
0 1 a
2 3 c
4 5 e
6 7 g
8 9 i
9 10 j
how it could be achieved?
Use union of indices and select by loc if default RangeIndex:
a = df.loc[df.index[::2].union([df.index[-1]])]
print(a)
x1 x2
0 1 a
2 3 c
4 5 e
6 7 g
8 9 i
9 10 j
Detail:
print(df.index[::2].union([df.index[-1]]))
Int64Index([0, 2, 4, 6, 8, 9], dtype='int64')
Another more general solution:
data = {"x1": [1,2,3,4,5,6,7,8,9,10], "x2": ["a","b","c","d","e","f","g","h","i","j"]}
df = pd.DataFrame(data=data, index=[0]*10)
print (df)
x1 x2
0 1 a
0 2 b
0 3 c
0 4 d
0 5 e
0 6 f
0 7 g
0 8 h
0 9 i
0 10 j
arr = np.arange(len(df.index))
a = df.iloc[np.union1d(arr[::2], [arr[-1]])]
print(a)
x1 x2
0 1 a
0 3 c
0 5 e
0 7 g
0 9 i
0 10 j

Element wise multiplication of each row

I have two DataFrame objects which I want to apply an element-wise multiplication on each row onto:
df_prob_wc.shape # (3505, 13)
df_prob_c.shape # (13, 1)
I thought I could do it with DataFrame.apply()
df_prob_wc.apply(lambda x: x.multiply(df_prob_c), axis=1)
which gives me:
TypeError: ("'int' object is not iterable", 'occurred at index $')
or with
df_prob_wc.apply(lambda x: x * df_prob_c, axis=1)
which gives me:
TypeError: 'int' object is not iterable
But it's not working.
However, I can do this:
df_prob_wc.apply(lambda x: x * np.asarray([1,2,3,4,5,6,7,8,9,10,11,12,13]), axis=1)
What am I doing wrong here?
It seems you need multiple by Series created with df_prob_c by iloc:
df_prob_wc = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df_prob_wc)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
df_prob_c = pd.DataFrame([[4,5,6,1,2,3]])
#for align data same columns in both df
df_prob_c.index = df_prob_wc.columns
print (df_prob_c)
0
A 4
B 5
C 6
D 1
E 2
F 3
print (df_prob_wc.shape)
(3, 6)
print (df_prob_c.shape)
(6, 1)
print (df_prob_c.iloc[:,0])
A 4
B 5
C 6
D 1
E 2
F 3
Name: 0, dtype: int64
print (df_prob_wc.mul(df_prob_c.iloc[:,0], axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9
Another solution is multiple by numpy array, only need [:,0] for select:
print (df_prob_wc.mul(df_prob_c.values[:,0], axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9
And another solution with DataFrame.squeeze:
print (df_prob_wc.mul(df_prob_c.squeeze(), axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9

Subsetting index from Pandas DataFrame

I have a DataFrame with columns [A, B, C, D, E, F, G, H].
An index has been made with columns [D, G, H]:
>>> print(dgh_columns)
Index(['D', 'G', 'H'], dtype='object')
How can I retrieve the original DataFrame without the columns D, G, H ?
Is there an index subset operation?
Ideally, this would be:
df[df.index - dgh_columns]
But this doesn't seem to work
I think you can use Index.difference:
df[df.columns.difference(dgh_columns)]
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[7,8,9],
'F':[1,3,5],
'G':[5,3,6],
'H':[7,4,3]})
print (df)
A B C D E F G H
0 1 4 7 1 7 1 5 7
1 2 5 8 3 8 3 3 4
2 3 6 9 5 9 5 6 3
dgh_columns = pd.Index(['D', 'G', 'H'])
print (df[df.columns.difference(dgh_columns)])
A B C E F
0 1 4 7 7 1
1 2 5 8 8 3
2 3 6 9 9 5
Numpy solution with numpy.setxor1d or numpy.setdiff1d:
dgh_columns = pd.Index(['D', 'G', 'H'])
print (df[np.setxor1d(df.columns, dgh_columns)])
A B C E F
0 1 4 7 7 1
1 2 5 8 8 3
2 3 6 9 9 5
dgh_columns = pd.Index(['D', 'G', 'H'])
print (df[np.setdiff1d(df.columns, dgh_columns)])
A B C E F
0 1 4 7 7 1
1 2 5 8 8 3
2 3 6 9 9 5
use drop
df.drop(list('DGH'), axis=1)
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[7,8,9],
'F':[1,3,5],
'G':[5,3,6],
'H':[7,4,3]})
df.drop(list('DGH'), 1)

pandas Groupby after groupby

df = pd.DataFrame({'A': [1,2,3,1,2,3], 'B': [10,10,11,10,10,15], 'key1':['a','b','a','b','c','c'],'key2':1})
df1 = pd.DataFrame({'A': [1,2,3,1,2,3], 'B': [100,100,110,100,100,150], 'key1':['a','c','b','a','a','c'],'key2':1})
dfn = pd.merge(df,df1,on='key2')
dfn_grouped = dfn.groupby('key1_y')
the list(dfn_grouped):
[('a', A_x B_x key1_x key2 A_y B_y key1_y
0 1 10 a 1 1 100 a
3 1 10 a 1 1 100 a
... ... ... ...
33 3 15 c 1 1 100 a
34 3 15 c 1 2 100 a),
('b', A_x B_x key1_x key2 A_y B_y key1_y
2 1 10 a 1 3 110 b
8 2 10 b 1 3 110 b
14 3 11 a 1 3 110 b
20 1 10 b 1 3 110 b
26 2 10 c 1 3 110 b
32 3 15 c 1 3 110 b),
('c', A_x B_x key1_x key2 A_y B_y key1_y
1 1 10 a 1 2 100 c
...... ... ....
35 3 15 c 1 3 150 c)]
now i need groupby the dfn_grouped by "key1_x" and concat to dict like A_x:A_y
key1_y key1_x A_X:A_Y
b a {'10':'110','11':110}
b b {'10':110}
b c {'10':110,'15':110}
// if A_x in dict append the A_y like:
// b e {'10':[11,12]}
Is this what you need?:
>> grouped = dfn.groupby(['key1_y','key1_x','A_x'])
>> dfg = pd.DataFrame(grouped.apply(lambda x: [a for a in x.A_y])).reset_index()
>> dfg.columns = [u'key1_y', u'key1_x', u'A_x', 'dic_values']
>> dfg['dic'] = [{a:b} for a,b in zip(dfg.A_x.values,dfg.dic_values.values)]
>> dfg.drop(['A_x','dic_values'],1,inplace=True)
>> g_dics = dfg.groupby(['key1_y','key1_x']).apply(lambda x: dict(sum(map(dict.items, [d for d in x.dic]), [])))
>> pd.DataFrame(g_dics).reset_index()