Rearrange rows of pandas dataframe based on list and keeping the order - pandas

import numpy as np
import pandas as pd
df = pd.DataFrame(data={'result':[-6.77,6.11,5.67,-7.679,-0.0930,4.342]}\
,index=['A','B','C','D','E','F'])
new_order = np.array([1,2,2,0,1,0])
The new_order numpy array assigns each row to one of three groups [0,1 or 2]. I would like to rearrange the rows of df so that those rows in group 0 appear first, followed by 1, and finally 2. Within each of the three groups the initial ordering should remain unchanged.
At the start the df is arranged as follows:
result
A -6.770
B 6.110
C 5.670
D -7.679
E -0.093
F 4.342
Here is the desired output given the above input data.
result
D -7.679
F 4.342
A -6.770
E -0.093
B 6.110
C 5.670

You could use argsort with kind='mergesort' to get sorted row indices that keeps the order and then simply index into the dataframe with those for the desired output, like so -
df.iloc[new_order.argsort(kind='mergesort')]
Sample run -
In [2]: df
Out[2]:
result
A -6.770
B 6.110
C 5.670
D -7.679
E -0.093
F 4.342
In [3]: df.iloc[new_order.argsort(kind='mergesort')]
Out[3]:
result
D -7.679
F 4.342
A -6.770
E -0.093
B 6.110
C 5.670

pure pandas
df.set_index(new_order, append=True) \
.sort_index(level=1) \
.reset_index(1, drop=True)
explanation
append new_order to the index
set_index(new_order, append=True)
use that new index level and sort by it
sort_index(level=1)
drop the index level I added
reset_index(1, drop=True)

Related

saving dataframe groupby rows to exactly two lines

I got a dataframe and I want to groupby the rows based on a specific column. Number of rows in each group will be at least 4 and at most 50. I want to save one column from the group into two lines. If the groupsize is even, let us say 2n, then n rows in one line and the remaining n in the second line. If it is odd, n+1 and n or n and n+1 will do.
For example,
import pandas as pd
from io import StringIO
data = """
id,name
1,A
1,B
1,C
1,D
2,E
2,F
2,ds
2,G
2, dsds
"""
df = pd.read_csv(StringIO(data))
I want to groupby id
df.groupby('id',sort=False)
and then get a dataframe like
id name
0 1 A B
1 1 C D
2 2 E F ds
3 2 G dsds
Probably not the most efficient solution, but it works:
import numpy as np
df = df.sort_values('id')
# next 3 lines: for each group find the separation
df['range_idx'] = range(0, df.shape[0])
df['mean_rank_group'] = df.groupby(['id'])['range_idx'].transform(np.mean)
df['separate_column'] = df['range_idx'] < df['mean_rank_group']
# groupby itself with the help of additional column
df.groupby(['id', 'separate_column'], as_index=False)['name'].agg(','.join).drop(
columns='separate_column')
This is a bit convoluted approach but it does the work;
def func(s: pd.Series):
mid = max(s.shape[0]//2 ,1)
l1 = ' '.join(list(s[:mid]))
l2 = ' '.join(list(s[mid:]))
return [l1, l2]
df_new = df.groupby('id').agg(func)
df_new["name1"]= df_new["name"].apply(lambda x: x[0])
df_new["name2"]= df_new["name"].apply(lambda x: x[1])
df = df_new.drop(labels="name", axis=1).stack().reset_index().drop(labels = ["level_1"], axis=1).rename(columns={0:"name"}).set_index("id")

How to quickly normalise data in pandas dataframe?

I have a pandas dataframe as follows.
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
print(df)
A B C
0 1 100 a
1 2 300 b
2 3 500 c
I want to normalise the entire dataframe. Since column C is not a numbered column what I do is as follows (i.e. remove C first, normalise data and add the column).
df_new = df.drop('concept', axis=1)
df_concept = df[['concept']]
from sklearn import preprocessing
x = df_new.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_new = pd.DataFrame(x_scaled)
df_new['concept'] = df_concept
However, I am sure that there is more easy way of doing this in pandas (given the column names that I do not need to normalise, then do the normalisation straightforward).
I am happy to provide more details if needed.
Use DataFrame.select_dtypes for DataFrame with numeric columns and then normalize with division by minimal and maximal values and then assign back only normalized columns:
df1 = df.select_dtypes(np.number)
df[df1.columns]=(df1-df1.min())/(df1.max()-df1.min())
print (df)
A B C
0 0.0 0.0 a
1 0.5 0.5 b
2 1.0 1.0 c
In case you want to apply any other functions on the data frame, you can use df[columns] = df[columns].apply(func).

Getting the specific index number of every group

In this sample dataframe df:
import pandas as pd
import numpy as np
i = ['dog', 'cat', 'elephant'] * 3
df = pd.DataFrame(np.random.randn(9, 4), index=i,
columns=list('ABCD')).sort_index()
What is the quickest way to get the 2nd row of each animal as a dataframe?
You're looking for nth. If an animal has only a single row, no result will be returned.
pandas.core.groupby.GroupBy.nth(n, dropna=None)
Take the nth row from each group if n is an int, or a subset of rows if n is a list of ints.
df.groupby(level=0).nth(1)
A B C D
cat -2.189615 -0.527398 0.786284 1.442453
dog 2.190704 0.607252 0.071074 -1.622508
elephant -2.536345 0.228888 0.716221 0.472490
You can group the data by index and get elements at index 1 (second row) for each group
new_df = df.groupby(level=0).apply(lambda x: x.iloc[1, :])
A B C D
cat 0.089608 -1.181394 -0.149988 -1.634295
dog 0.002782 1.620430 0.622397 0.058401
elephant 1.022441 -2.185710 0.854900 0.979411
If you expect any group with single value in your dataframe, you can build in that condition
new_df = df.groupby(level=0).apply(lambda x: x.iloc[1, :] if len(x) > 1 else None).dropna()

pandas: Assign values to a slice a MultiIndex by range of secondary index

I have a problem with assigning a series like object to a slice of a Pandas dataframe.
Maybe I'm not using the Datafarme the way it is intended to, so some enlightment will be greatly appreciated.
I've already read the following articles:
pandas: slice a MultiIndex by range of secondary index
Returning a view versus a copy
As far as I understand the way I'm evoking the slice with one .loc call does ensure I'm getting not a copy of the data. Obviously also the original dataframe gets altered, but instead of the expected data I get NaN values.
See the appended code snipet.
Do I have to iterate over the desired section of the dataframe for each single value I want to change and use the .set_value(row_idx,col_idx,val) method?
kind regards and thanks in advance
Markus
In [1]: import pandas as pd
In [2]: mindex = pd.MultiIndex.from_product([['one','two'],['first','second']])
In [3]: dfmi = pd.DataFrame([list('abcd'),list('efgh'),list('ijkl'),list('mnop')],
...: index = mindex, columns=(['X','Y','Z','Q']))
In [4]: print(dfmi)
X Y Z Q
one first a b c d
second e f g h
two first i j k l
second m n o p
In [5]: dfmi.loc[('two',slice('first','second')),'X']
Out[5]:
two first i
second m
Name: X, dtype: object
In [6]: substitute = pd.Series(data=["ab","cd"], index= mindex.levels[1])
...: print(substitute)
first ab
second cd
dtype: object
In [7]: dfmi.loc[('two',slice('first','second')),'X'] = substitute
In [8]: print(dfmi)
X Y Z Q
one first a b c d
second e f g h
two first NaN j k l
second NaN n o p
What's happening is that substitute has an index, which determine the location of the values, and dfmi.loc[('two',slice('first','second')),'X'] is also specifying such location.
During the assignment pandas is trying to align both index and since they do not match (they would if substitute was also a multi-index), the result of the alignment are all NA's, which get inserted.
A solution could be to get rid of the index of substitute since the location of where you want to insert the values is already specified in the loc:
dfmi.loc[('two',slice('first','second')),'X'] = substitute.values
or even simpler, insert the values directly:
dfmi.loc[('two',slice('first','second')),'X'] = ["ab","cd"]
Can you try this:
dfmi.loc['two']['X']=substitute

plotting categorical data pandas/bokeh

I have a pandas dataframe which gives the number of pass and fail students in every subject. I want to generate a plot which gives pass and fail for every subject. I tried groupby method but i can get a plot for a single one. I want a plot which has subject names as x-axis and no. of pass and fails in y-axis. Here is the sample dataframe.
10IS665 10ISL67 10ISL68
F F P
F F P
P P F
p P P
p P P
p F F
Create some test data first:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(100, 4) > 0.3, columns=list("ABCD"))
df = df.replace([True, False], ["P", "F"])
apply() value_count() to every column of the data, and transpose the result:
df_count = df.apply(pd.value_counts).T
Then call plot() with kind="bar":
df_count.plot(kind="bar")
output: