Getting the specific index number of every group - pandas

In this sample dataframe df:
import pandas as pd
import numpy as np
i = ['dog', 'cat', 'elephant'] * 3
df = pd.DataFrame(np.random.randn(9, 4), index=i,
columns=list('ABCD')).sort_index()
What is the quickest way to get the 2nd row of each animal as a dataframe?

You're looking for nth. If an animal has only a single row, no result will be returned.
pandas.core.groupby.GroupBy.nth(n, dropna=None)
Take the nth row from each group if n is an int, or a subset of rows if n is a list of ints.
df.groupby(level=0).nth(1)
A B C D
cat -2.189615 -0.527398 0.786284 1.442453
dog 2.190704 0.607252 0.071074 -1.622508
elephant -2.536345 0.228888 0.716221 0.472490

You can group the data by index and get elements at index 1 (second row) for each group
new_df = df.groupby(level=0).apply(lambda x: x.iloc[1, :])
A B C D
cat 0.089608 -1.181394 -0.149988 -1.634295
dog 0.002782 1.620430 0.622397 0.058401
elephant 1.022441 -2.185710 0.854900 0.979411
If you expect any group with single value in your dataframe, you can build in that condition
new_df = df.groupby(level=0).apply(lambda x: x.iloc[1, :] if len(x) > 1 else None).dropna()

Related

Drop duplicates and combine IDs where those duplication exist [duplicate]

I tried to use groupby to group rows with multiple values.
col val
A Cat
A Tiger
B Ball
B Bat
import pandas as pd
df = pd.read_csv("Inputfile.txt", sep='\t')
group = df.groupby(['col'])['val'].sum()
I got
A CatTiger
B BallBat
I want to introduce a delimiter, so that my output looks like
A Cat-Tiger
B Ball-Bat
I tried,
group = df.groupby(['col'])['val'].sum().apply(lambda x: '-'.join(x))
this yielded,
A C-a-t-T-i-g-e-r
B B-a-l-l-B-a-t
What is the issue here ?
Thanks,
AP
Alternatively you can do it this way:
In [48]: df.groupby('col')['val'].agg('-'.join)
Out[48]:
col
A Cat-Tiger
B Ball-Bat
Name: val, dtype: object
UPDATE: answering question from the comment:
In [2]: df
Out[2]:
col val
0 A Cat
1 A Tiger
2 A Panda
3 B Ball
4 B Bat
5 B Mouse
6 B Egg
In [3]: df.groupby('col')['val'].agg('-'.join)
Out[3]:
col
A Cat-Tiger-Panda
B Ball-Bat-Mouse-Egg
Name: val, dtype: object
Last for convert index or MultiIndex to columns:
df1 = df.groupby('col')['val'].agg('-'.join).reset_index(name='new')
just try
group = df.groupby(['col'])['val'].apply(lambda x: '-'.join(x))

pandas return auxilliary column from groupby and max

I have a pandas DataFrame with 3 columns, A, B, and V.
I want a DataFrame with A as the index and one column, which contains the B for the maximum V
I can easily create a df with A and the maximum V using groupby, and then perform some machinations to extract the corresponding B, but that seems like the wrong idea.
I've been playing with combinations of groupby and agg with no joy.
Sample Data:
A,B,V
MHQ,Q,0.5192
MMO,Q,0.4461
MTR,Q,0.5385
MVM,Q,0.351
NCR,Q,0.0704
MHQ,E,0.5435
MMO,E,0.4533
MTR,E,-0.6716
MVM,E,0.3684
NCR,E,-0.0278
MHQ,U,0.2712
MMO,U,0.1923
MTR,U,0.3833
MVM,U,0.1355
NCR,U,0.1058
A = [1,1,1,2,2,2,3,3,3,4,4,4]
B = [1,2,3,4,5,6,7,8,9,10,11,12]
V = [21,22,23,24,25,26,27,28,29,30,31,32]
df = pd.DataFrame({'A': A, 'B': B, 'V': V})
res = df.groupby('A').apply(
lambda x: x[x['V']==x['V'].max()]).set_index('A')['B'].to_frame()
res
B
A
1 3
2 6
3 9
4 12

saving dataframe groupby rows to exactly two lines

I got a dataframe and I want to groupby the rows based on a specific column. Number of rows in each group will be at least 4 and at most 50. I want to save one column from the group into two lines. If the groupsize is even, let us say 2n, then n rows in one line and the remaining n in the second line. If it is odd, n+1 and n or n and n+1 will do.
For example,
import pandas as pd
from io import StringIO
data = """
id,name
1,A
1,B
1,C
1,D
2,E
2,F
2,ds
2,G
2, dsds
"""
df = pd.read_csv(StringIO(data))
I want to groupby id
df.groupby('id',sort=False)
and then get a dataframe like
id name
0 1 A B
1 1 C D
2 2 E F ds
3 2 G dsds
Probably not the most efficient solution, but it works:
import numpy as np
df = df.sort_values('id')
# next 3 lines: for each group find the separation
df['range_idx'] = range(0, df.shape[0])
df['mean_rank_group'] = df.groupby(['id'])['range_idx'].transform(np.mean)
df['separate_column'] = df['range_idx'] < df['mean_rank_group']
# groupby itself with the help of additional column
df.groupby(['id', 'separate_column'], as_index=False)['name'].agg(','.join).drop(
columns='separate_column')
This is a bit convoluted approach but it does the work;
def func(s: pd.Series):
mid = max(s.shape[0]//2 ,1)
l1 = ' '.join(list(s[:mid]))
l2 = ' '.join(list(s[mid:]))
return [l1, l2]
df_new = df.groupby('id').agg(func)
df_new["name1"]= df_new["name"].apply(lambda x: x[0])
df_new["name2"]= df_new["name"].apply(lambda x: x[1])
df = df_new.drop(labels="name", axis=1).stack().reset_index().drop(labels = ["level_1"], axis=1).rename(columns={0:"name"}).set_index("id")

How to quickly normalise data in pandas dataframe?

I have a pandas dataframe as follows.
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
print(df)
A B C
0 1 100 a
1 2 300 b
2 3 500 c
I want to normalise the entire dataframe. Since column C is not a numbered column what I do is as follows (i.e. remove C first, normalise data and add the column).
df_new = df.drop('concept', axis=1)
df_concept = df[['concept']]
from sklearn import preprocessing
x = df_new.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_new = pd.DataFrame(x_scaled)
df_new['concept'] = df_concept
However, I am sure that there is more easy way of doing this in pandas (given the column names that I do not need to normalise, then do the normalisation straightforward).
I am happy to provide more details if needed.
Use DataFrame.select_dtypes for DataFrame with numeric columns and then normalize with division by minimal and maximal values and then assign back only normalized columns:
df1 = df.select_dtypes(np.number)
df[df1.columns]=(df1-df1.min())/(df1.max()-df1.min())
print (df)
A B C
0 0.0 0.0 a
1 0.5 0.5 b
2 1.0 1.0 c
In case you want to apply any other functions on the data frame, you can use df[columns] = df[columns].apply(func).

Rearrange rows of pandas dataframe based on list and keeping the order

import numpy as np
import pandas as pd
df = pd.DataFrame(data={'result':[-6.77,6.11,5.67,-7.679,-0.0930,4.342]}\
,index=['A','B','C','D','E','F'])
new_order = np.array([1,2,2,0,1,0])
The new_order numpy array assigns each row to one of three groups [0,1 or 2]. I would like to rearrange the rows of df so that those rows in group 0 appear first, followed by 1, and finally 2. Within each of the three groups the initial ordering should remain unchanged.
At the start the df is arranged as follows:
result
A -6.770
B 6.110
C 5.670
D -7.679
E -0.093
F 4.342
Here is the desired output given the above input data.
result
D -7.679
F 4.342
A -6.770
E -0.093
B 6.110
C 5.670
You could use argsort with kind='mergesort' to get sorted row indices that keeps the order and then simply index into the dataframe with those for the desired output, like so -
df.iloc[new_order.argsort(kind='mergesort')]
Sample run -
In [2]: df
Out[2]:
result
A -6.770
B 6.110
C 5.670
D -7.679
E -0.093
F 4.342
In [3]: df.iloc[new_order.argsort(kind='mergesort')]
Out[3]:
result
D -7.679
F 4.342
A -6.770
E -0.093
B 6.110
C 5.670
pure pandas
df.set_index(new_order, append=True) \
.sort_index(level=1) \
.reset_index(1, drop=True)
explanation
append new_order to the index
set_index(new_order, append=True)
use that new index level and sort by it
sort_index(level=1)
drop the index level I added
reset_index(1, drop=True)