How to display nth-row and last row - pandas

I need to display nth rows and last one using pandas. I know that nth-rows could be displayed by using iloc
for example:
data = {"x1": [1,2,3,4,5,6,7,8,9,10], "x2": ["a","b","c","d","e","f","g","h","i","j"]}
df = pd.DataFrame(data=data)
a = df.iloc[::2]
print(a)
will display
x1 x2
0 1 a
2 3 c
4 5 e
6 7 g
8 9 i
But I need it to be:
x1 x2
0 1 a
2 3 c
4 5 e
6 7 g
8 9 i
9 10 j
how it could be achieved?

Use union of indices and select by loc if default RangeIndex:
a = df.loc[df.index[::2].union([df.index[-1]])]
print(a)
x1 x2
0 1 a
2 3 c
4 5 e
6 7 g
8 9 i
9 10 j
Detail:
print(df.index[::2].union([df.index[-1]]))
Int64Index([0, 2, 4, 6, 8, 9], dtype='int64')
Another more general solution:
data = {"x1": [1,2,3,4,5,6,7,8,9,10], "x2": ["a","b","c","d","e","f","g","h","i","j"]}
df = pd.DataFrame(data=data, index=[0]*10)
print (df)
x1 x2
0 1 a
0 2 b
0 3 c
0 4 d
0 5 e
0 6 f
0 7 g
0 8 h
0 9 i
0 10 j
arr = np.arange(len(df.index))
a = df.iloc[np.union1d(arr[::2], [arr[-1]])]
print(a)
x1 x2
0 1 a
0 3 c
0 5 e
0 7 g
0 9 i
0 10 j

Related

Split and concatenate dataframe

So i have dataframe which looks like this one:
>>>df = pd.DataFrame({
'id': [i for i in range(5)],
'1': ['a', 'b', 'c', 'd', 'e'],
'2': ['f', 'g', 'h', 'i', 'g']
})
>>>df
id 1 2
0 0 a f
1 1 b g
2 2 c h
3 3 d i
4 4 e g
I want to convert this dataframe to following dataframe
>>>df_concatenated
id val
1 0 a
1 1 b
1 2 c
1 3 d
1 4 e
2 0 f
2 1 g
2 2 h
2 3 i
2 4 g
One way is to pd.melt
pd.melt(df, id_vars=['id'], value_vars=['1','2']).set_index('variable',append=True)
The other is by splitting by .loc accessor and concatenating. Long but it works
res1=df.iloc[:,[0,2]]
res1.columns=['id','val']
res=df.iloc[:,:2]
res.columns=['id','val']
res2=pd.concat([res1,res])
res2
variable id value
0 1 0 a
1 1 1 b
2 1 2 c
3 1 3 d
4 1 4 e
5 2 0 f
6 2 1 g
7 2 2 h
8 2 3 i
9 2 4 g
You can try this:
df = df.rename({"1":"val"},axis=1)
df_temp = df[["id","2"]]
df_temp = df_temp.rename({"2":"val"},axis=1)
df.drop("2",axis=1,inplace=True)
out_df = pd.concat([df,df_temp],axis=0).reset_index(drop=True)
print(out_df)
output:
id val
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
5 0 f
6 1 g
7 2 h
8 3 i
9 4 g

Pandas dataframe rename column

I splited a dataframe into two parts and changed their column names seperately. Here's what I got:
df1 = df[df['colname'==0]]
df2 = df[df['colname'==1]]
df1.columns = [ 'a'+ x for x in df1.columns]
df2.columns = [ 'b'+ x for x in df2.columns]
And it turned out df2 has the columns start with 'ba' rather than 'b'. What happened?
I cannot simulate your problem, for me working nice.
Alternative solution should be add_prefix instead list comprehension:
df = pd.DataFrame({'colname':[0,1,0,0,0,1],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
C D E F colname
0 7 1 5 a 0
1 8 3 3 a 1
2 9 5 6 a 0
3 4 7 9 b 0
4 2 1 2 b 0
5 3 0 4 b 1
df1 = df[df['colname']==0].add_prefix('a')
df2 = df[df['colname']==1].add_prefix('b')
print (df1)
aC aD aE aF acolname
0 7 1 5 a 0
2 9 5 6 a 0
3 4 7 9 b 0
4 2 1 2 b 0
print (df2)
bC bD bE bF bcolname
1 8 3 3 a 1
5 3 0 4 b 1

Convert an indexed pandas matrix to a flat dataframe

Given the dataframe:
df = pd.DataFrame([['foo', 123, 4, 5, 0, 1], ['foo', 123, 4, 0, 9, 1], ['bar', 33, 0, 0, 3, 5]], columns=list('ABCDEF'))
[out]:
A B C D E F
0 foo 123 4 5 0 1
1 foo 123 4 0 9 1
2 bar 33 0 0 3 5
The goal is to sum certain columns ('C', 'D', 'E', F') using other columns ('A' and 'B') as keys to achieve:
A B C D E F
0 foo 123 8 5 9 2
2 bar 33 0 0 3 5
I've tried:
df.groupby(['A', 'B']).sum()
[out]:
C D E F
A B
bar 33 0 0 3 5
foo 123 8 5 9 2
How do I change it back to the non-indexed matrix? i.e.
A B C D E F
0 foo 123 8 5 9 2
2 bar 33 0 0 3 5
You need to add .reset_index().
df.groupby(['A','B']).sum().reset_index()
A B C D E F
0 bar 33 0 0 3 5
1 foo 123 8 5 9 2
or
df.set_index(['A','B']).sum(level=[0,1]).reset_index()
A B C D E F
0 bar 33 0 0 3 5
1 foo 123 8 5 9 2
You can use parameter as_index=False for return df:
df1 = df.groupby(['A', 'B'], as_index=False).sum()
print (df1)
A B C D E F
0 bar 33 0 0 3 5
1 foo 123 8 5 9 2

Element wise multiplication of each row

I have two DataFrame objects which I want to apply an element-wise multiplication on each row onto:
df_prob_wc.shape # (3505, 13)
df_prob_c.shape # (13, 1)
I thought I could do it with DataFrame.apply()
df_prob_wc.apply(lambda x: x.multiply(df_prob_c), axis=1)
which gives me:
TypeError: ("'int' object is not iterable", 'occurred at index $')
or with
df_prob_wc.apply(lambda x: x * df_prob_c, axis=1)
which gives me:
TypeError: 'int' object is not iterable
But it's not working.
However, I can do this:
df_prob_wc.apply(lambda x: x * np.asarray([1,2,3,4,5,6,7,8,9,10,11,12,13]), axis=1)
What am I doing wrong here?
It seems you need multiple by Series created with df_prob_c by iloc:
df_prob_wc = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df_prob_wc)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
df_prob_c = pd.DataFrame([[4,5,6,1,2,3]])
#for align data same columns in both df
df_prob_c.index = df_prob_wc.columns
print (df_prob_c)
0
A 4
B 5
C 6
D 1
E 2
F 3
print (df_prob_wc.shape)
(3, 6)
print (df_prob_c.shape)
(6, 1)
print (df_prob_c.iloc[:,0])
A 4
B 5
C 6
D 1
E 2
F 3
Name: 0, dtype: int64
print (df_prob_wc.mul(df_prob_c.iloc[:,0], axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9
Another solution is multiple by numpy array, only need [:,0] for select:
print (df_prob_wc.mul(df_prob_c.values[:,0], axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9
And another solution with DataFrame.squeeze:
print (df_prob_wc.mul(df_prob_c.squeeze(), axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9

Subsetting index from Pandas DataFrame

I have a DataFrame with columns [A, B, C, D, E, F, G, H].
An index has been made with columns [D, G, H]:
>>> print(dgh_columns)
Index(['D', 'G', 'H'], dtype='object')
How can I retrieve the original DataFrame without the columns D, G, H ?
Is there an index subset operation?
Ideally, this would be:
df[df.index - dgh_columns]
But this doesn't seem to work
I think you can use Index.difference:
df[df.columns.difference(dgh_columns)]
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[7,8,9],
'F':[1,3,5],
'G':[5,3,6],
'H':[7,4,3]})
print (df)
A B C D E F G H
0 1 4 7 1 7 1 5 7
1 2 5 8 3 8 3 3 4
2 3 6 9 5 9 5 6 3
dgh_columns = pd.Index(['D', 'G', 'H'])
print (df[df.columns.difference(dgh_columns)])
A B C E F
0 1 4 7 7 1
1 2 5 8 8 3
2 3 6 9 9 5
Numpy solution with numpy.setxor1d or numpy.setdiff1d:
dgh_columns = pd.Index(['D', 'G', 'H'])
print (df[np.setxor1d(df.columns, dgh_columns)])
A B C E F
0 1 4 7 7 1
1 2 5 8 8 3
2 3 6 9 9 5
dgh_columns = pd.Index(['D', 'G', 'H'])
print (df[np.setdiff1d(df.columns, dgh_columns)])
A B C E F
0 1 4 7 7 1
1 2 5 8 8 3
2 3 6 9 9 5
use drop
df.drop(list('DGH'), axis=1)
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[7,8,9],
'F':[1,3,5],
'G':[5,3,6],
'H':[7,4,3]})
df.drop(list('DGH'), 1)