Sort DataFrame asc/desc based on column value - pandas

I have this DataFrame
df = pd.DataFrame({'A': [100, 100, 300, 200, 200, 200], 'B': [60, 55, 12, 32, 15, 44], 'C': ['x', 'x', 'y', 'y', 'y', 'y']})
and I want to sort it by columns "A" and "B". "A" is always ascending. I also want ascending for "B" if "C == x", else descending for "B" if "C == y". So it would end up like this
df_sorted = pd.DataFrame({'A': [100, 100, 200, 200, 200, 300], 'B': [55, 60, 44, 32, 15, 12], 'C': ['x', 'x', 'y', 'y', 'y', 'y']})

I would filter each DataFrame into two Dataframe based on the value of C:
df_x = df.loc[df['C'] == 'x']
df_y = df.loc[df['C'] == 'y']
and then use "sort_values" like so:
df_x.sort_values(by=['A', 'B'], inplace=True)
sorting df_y will be different since you want one column ascending and the other descending, since "sort_values" is stable we can do it like so
df_y.sort_values(by=['A'], inplace=True)
df_y.sort_values(by=['b'], inplace=True, ascending=False)
You can then merge the DataFrames back and sort again by A and the order will remain.

You can set up a temporary column to invert the values of "B" when "C" equals "x", sort, and drop the column:
(df.assign(B2=df['B']*df['C'].eq('x').mul(2).sub(1))
.sort_values(by=['A', 'B2'])
.drop('B2', axis=1)
)

def function1(dd:pd.DataFrame):
return dd.sort_values(['A','B']) if dd.name=='x' else dd.sort_values(['A','B'],ascending=[True,False])
df.groupby('C').apply(function1).reset_index(drop=True)
A B C
0 100 55 x
1 100 60 x
2 200 44 y
3 200 32 y
4 200 15 y
5 300 12 y

Related

how to groupby but counsecutively only

i have a dataframe like this:
data = {'costs': [150, 400, 300, 500, 350], 'month':[1, 2, 2, 1, 1]}
df = pd.DataFrame(data)
i want to use groupby(['month']).sum() but first row not to be
cobmined with fourth and fifth rows so the result of costs would be
like this
list(df['costs'])= [150, 700, 850]
Try:
x = (
df.groupby((df.month != df.month.shift(1)).cumsum())
.agg({"costs": "sum", "month": "first"})
.reset_index(drop=True)
)
print(x)
Prints:
costs month
0 150 1
1 700 2
2 850 1

Calculate the difference between all rows and a specific row in the dataframe

This is a similar question to this thread.
Lets consider df as:
df = pd.DataFrame([["a", 2, 3], ["b", 5, 6], ["c", 8, 9],["a", 0, 0], ["a", 8, 7], ["c", 2, 1]], columns = ["A", "B", "C"])
How can you calculate the difference between all rows and the row at Nth index in a group (lowest index for EACH group) for column "B", and put it in column "D"? I want to calculate mean square displacement for my data and I want to calculate the difference of values in a column in each group with the first appeared row in that group.
I tried:
df['D'] = df.groupby(["A"])['B'].sub(df.groupby(['A'])["B"].iloc[0])
Group = df.groupby(["A"])
However using .sub and groupby raise the following error:
AttributeError: 'SeriesGroupBy' object has no attribute 'sub'
the desired result would be like this:
A B C D
0 a 2 3 0 *lowest index in group "a"
1 b 5 6 0 *lowest index in group "b"
2 c 8 9 0 *lowest index in group "c"
3 a 0 0 -2
4 a 8 7 6
5 c 2 1 -6
I guess this answer could be enough of a hint for you:
import pandas as pd
df = pd.DataFrame([["a", 2, 3], ["b", 5, 6], ["c", 8, 9],["a", 0, 0], ["a", 8, 7], ["c", 2, 1]], columns = ["A", "B", "C"])
print("df:")
print(df)
print()
groupA = df.groupby(['A'])
print("groupA:")
print(groupA.groups)
print()
print("lowest indices for each group from columnA:")
lowest_indices = dict()
for k, v in groupA.groups.items():
lowest_indices[k] = v[0]
print(lowest_indices)
print()
columnB = df['B']
print("columnB:")
print(columnB)
print()
df['D'] = df['B']
for i in range(len(df)):
group_at_i = df['A'].iloc[i]
lowest_index_of_that = lowest_indices[group_at_i]
b_element_at_that_index = df['B'].iloc[lowest_index_of_that]
the_difference = df['B'].iloc[i] - b_element_at_that_index
df.loc[i, 'D'] = the_difference
print("df:")
print(df)

Efficient column MultiIndex ordering

I have this dataframe :
df = pandas.DataFrame({'A' : [2000, 2000, 2000, 2000, 2000, 2000],
'B' : ["A+", 'B+', "A+", "B+", "A+", "B+"],
'C' : ["M", "M", "M", "F", "F", "F"],
'D' : [1, 5, 3, 4, 2, 6],
'Value' : [11, 12, 13, 14, 15, 16] }).set_index((['A', 'B', 'C', 'D']))
df = df.unstack(['C', 'D']).fillna(0)
And I'm wondering is there is a more elegant way to order the columns MultiIndex that the following code :
# rows ordering
df = df.sort_values(by = ['A', "B"], ascending = [True, True])
# col ordering
df = df.transpose().sort_values(by = ["C", "D"], ascending = [False, False]).transpose()
Especially I feel like the last line with the two transpose si far more complex than it should be. I tried using sort_index but wasn't able to use it in a MultiIndex context (for both lines and columns).
You can use sort index on both levels:
out = df.sort_index(level=[0,1],axis=1,ascending=[True, False])
I can use
axis=1
And therefore the last line become
df = df.sort_values(axis = 1, by = ["C", "D"], ascending = [True, False])

Generating boolean dataframe based on contents in series and dataframe

I have:
df = pd.DataFrame(
[
[22, 33, 44],
[55, 11, 22],
[33, 55, 11],
],
index=["abc", "def", "ghi"],
columns=list("abc")
) # size(3,3)
and:
unique = pd.Series([11, 22, 33, 44, 55]) # size(1,5)
then I create a new df based on unique and df, so that:
df_new = pd.DataFrame(index=unique, columns=df.columns) # size(5,3)
From this newly created df, I'd like to create a new boolean df based on unique and df, so that the end result is:
df_new = pd.DataFrame(
[
[0, 1, 1],
[1, 0, 1],
[1, 1, 0],
[0, 0, 1],
[1, 1, 0],
],
index=unique,
columns=df.columns
)
This new df is either true or false depending on whether the value is present in the original dataframe or not. For example, the first column has three values: [22, 55, 33]. In a df with dimensions (5,3), this first column would be: [0, 1, 1, 0, 1] i.e. [0, 22, 33, 0 , 55]
I tried filter2 = unique.isin(df) but this doesn't work, also notnull. I tried applying a filter but the dimensions returned were incorrect. How can I do this?
Use DataFrame.stack with DataFrame.reset_index, DataFrame.pivot, then check if not missing values by DataFrame.notna, cast to integers for True->1 and False->0 mapping and last remove index and columns names by DataFrame.rename_axis:
df_new = (df.stack()
.reset_index(name='v')
.pivot('v','level_1','level_0')
.notna()
.astype(int)
.rename_axis(index=None, columns=None))
print (df_new)
a b c
11 0 1 1
22 1 0 1
33 1 1 0
44 0 0 1
55 1 1 0
Helper Series is not necessary, but if there is more values or is necessary change order by helper Series use add DataFrame.reindex:
#added 66
unique = pd.Series([11, 22, 33, 44, 55,66])
df_new = (df.stack()
.reset_index(name='v')
.pivot('v','level_1','level_0')
.reindex(unique)
.notna()
.astype(int)
.rename_axis(index=None, columns=None))
print (df_new)
a b c
11 0 1 1
22 1 0 1
33 1 1 0
44 0 0 1
55 1 1 0
66 0 0 0

How to subtract one row to other rows in a grouped by dataframe?

I've got this data frame with some 'init' values ('value', 'value2') that I want to subtract to the mid term value 'mid' and final value 'final' once I've grouped by ID.
import pandas as pd
df = pd.DataFrame({
'value': [100, 120, 130, 200, 190,210],
'value2': [2100, 2120, 2130, 2200, 2190,2210],
'ID': [1, 1, 1, 2, 2, 2],
'state': ['init','mid', 'final', 'init', 'mid', 'final'],
})
My attempt was tho extract the index where I found 'init', 'mid' and 'final' and subtract from 'mid' and 'final' the value of 'init' once I've grouped the value by 'ID':
group = df.groupby('ID')
group['diff_1_f'] = group['value'].iloc[group.index[group['state'] == 'final'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]]
group['diff_2_f'] = group['value2'].iloc[group.index[group['state'] == 'final'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]
group['diff_1_m'] = group['value'].iloc[group.index[group['state'] == 'mid'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]
group['diff_2_m'] = group['value2'].iloc[group.index[group['state'] == 'mid'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]
But of course it doesn't work. How can I obtain the following result:
df = pd.DataFrame({
'diff_value': [20, 30, -10,10],
'diff_value2': [20, 30, -10,10],
'ID': [ 1, 1, 2, 2],
'state': ['mid', 'final', 'mid', 'final'],
})
Also in it's grouped form.
Use:
#columns names in list for subtract
cols = ['value', 'value2']
#new columns names created by join
new = [c + '_diff' for c in cols]
#filter rows with init
m = df['state'].ne('init')
#add init rows to new columns by join and filter no init rows
df1 = df.join(df[~m].set_index('ID')[cols], lsuffix='_diff', on='ID')[m]
#subtract with numpy array by .values for prevent index alignment
df1[new] = df1[new].sub(df1[cols].values)
#remove helper columns
df1 = df1.drop(cols, axis=1)
print (df1)
value_diff value2_diff ID state
1 20 20 1 mid
2 30 30 1 final
4 -10 -10 2 mid
5 10 10 2 final