I'm dealing with ranked ordered list data at massive scale. I need to compare how individuals rank institutions/programs across periods. I need help figuring out which is the most efficient way to deal with this.
A ranked ordered list (ROL): a report by individual in which they rank programs in institutions from most preferred to least preferred (0 being the most preferred).
Operations: I need to run multiple operations between ROLs. Such as if the order changes, are new institutions or programs are added, and a lot more that I'm not detailing here.
I started using dictionaries because I'm familiar with them, but for a subsample my code is taking 28 hours to run. I need to speed this up a lot. I'm particularly looking for advice in which is the most efficient way to work with this type of data.
Below there is a fake data set on which I'm running the code.
import pandas as pd
import numpy as np
# generate fake data frame
df = pd.DataFrame([[1, 1, 0, 100, 101], [1, 2, 0, 100, 101], [1, 2, 1, 100, 102], [2, 1, 0, 100, 101], [2, 2, 0, 100, 101], [2, 2, 1, 200, 202], [3, 1, 0, 100, 101], [3, 1, 1, 200, 201], [3, 2, 0, 100, 101], [3, 2, 1, 200, 201], [4, 1, 0, 100, 101], [4, 1, 1, 200, 201], [4, 2, 0, 200, 201], [4, 2, 1, 100, 101] ], columns=['id_individual', 'period', 'rank', 'id_institution', 'id_program'])
df['change_app'] = False
df['change_order'] = False
df['add_newinst'] = False
df['add_newprog'] = False
for indiv in df['id_individual'].unique():
# recover rank of each individual for each period
r_pre = df.loc[(df['id_individual'] == indiv) & (df['period'] == 1)]
r_post = df.loc[(df['id_individual'] == indiv) & (df['period'] == 2)]
# generate empty dict to store ranks
rank_pre = {}
rank_post = {}
# extract institution and program and assign to dictionary
for i in range(0, len(r_pre)):
rank_pre[i] = r_pre['id_institution'].loc[r_pre['rank'] == i].values[0], r_pre['id_program'].loc[r_pre['rank'] == i].values[0]
for i in range(0, len(r_post)):
rank_post[i] = r_post['id_institution'].loc[r_post['rank'] == i].values[0], r_post['id_program'].loc[r_post['rank'] == i].values[0]
# if dictionaries are different, then compute some cases
if rank_pre != rank_post:
# Replace change app to true
df['change_app'].loc[(df['id_individual'] == indiv)] = True
# check if it was a reorder
df['change_order'].loc[(df['id_individual'] == indiv)] = (set(rank_pre.values()) == set(rank_post.values())) & (len(rank_pre) == len(rank_post))
# get the set of values in the first position of the tuple
programs_pre = set(rank_pre.values())
programs_post = set(rank_post.values())
inst_pre = set([x[0] for x in rank_pre.values()])
inst_post = set([x[0] for x in rank_post.values()])
# Added institution: if set of inst_post has an element that is not in inst_pre
df['add_newinst'].loc[(df['id_individual'] == indiv)] = len(inst_post - inst_pre) > 0
# Added program: if set of programs_post has an element that is not in programs_pre
df['add_newprog'].loc[(df['id_individual'] == indiv)] = len(programs_post - programs_pre) > 0
df.head(14)
Expected Output:
id_individual period rank id_institution id_program change_app change_order add_newinst add_newprog
0 1 1 0 100 101 True False False True
1 1 2 0 100 101 True False False True
2 1 2 1 100 102 True False False True
3 2 1 0 100 101 True False True True
4 2 2 0 100 101 True False True True
5 2 2 1 200 202 True False True True
6 3 1 0 100 101 False False False False
7 3 1 1 200 201 False False False False
8 3 2 0 100 101 False False False False
9 3 2 1 200 201 False False False False
10 4 1 0 100 101 True True False False
11 4 1 1 200 201 True True False False
12 4 2 0 200 201 True True False False
13 4 2 1 100 101 True True False False
I tried: performing operations over ranked ordered lists from individuals using pandas/dictionaries.
I expected: low computing time.
For 500.000 individuals, comparing ranked ordered lists is taking around 20 hours
Making some pivot tables that we can perform some vectorized functions on should perform far faster than any manual loop...
# Test Data
df = pd.DataFrame([[1, 1, 0, 100, 101], [1, 2, 0, 100, 101], [1, 2, 1, 100, 102], [2, 1, 0, 100, 101], [2, 2, 0, 100, 101], [2, 2, 1, 200, 202], [3, 1, 0, 100, 101], [3, 1, 1, 200, 201], [3, 2, 0, 100, 101], [3, 2, 1, 200, 201], [4, 1, 0, 100, 101], [4, 1, 1, 200, 201], [4, 2, 0, 200, 201], [4, 2, 1, 100, 101] ], columns=['id_individual', 'period', 'rank', 'id_institution', 'id_program'])
# Pivot Config, we'll use this twice~
config = {
'index': 'id_individual',
'columns': 'period',
'values': ['id_institution', 'id_program']
}
# First Pivot Table will be to look at unique values by period:
unique = df.pivot_table(**config, aggfunc=set)
# To look for changes, it'll help if there aren't missing values
# So, let's do a reindex trick to extract those missing values:
names = ['id_individual', 'period', 'rank']
df = (df.set_index(names)
.reindex(pd.MultiIndex.from_product(df[names].apply(set), names=names))
.reset_index())
# Now, we can make the Pivot Table of changes:
changes = df.pivot_table(**config, aggfunc=tuple)
# You're looking for a transform, so we'll set the index we're using
# to facilitate this. We'll also drop the NaN values we created,
# and convert back to integers.
df = df.set_index('id_individual').dropna().astype(int)
# Let's take the cross section of each period; looking at changes.
p1_c, p2_c = [changes.xs(x, level='period', axis=1) for x in (1,2)]
# It was changed if either column had any change:
was_changed = p1_c.ne(p2_c).any(1)
df['change_app'] = was_changed
# Let's take the cross section of each period; looking at unique values.
p1_u, p2_u = [unique.xs(x, level='period', axis=1) for x in (1,2)]
# First, are they the same?
same_vals = p1_u.eq(p2_u).all(1)
# If they're the same, and were changed, it was just an order change:
df['change_order'] = was_changed & same_vals
# We take advantage of set logic here, new things have been added
# if the first is a proper subset (<) of the second:
df['add_newinst'] = p1_u.id_institution.lt(p2_u.id_institution)
df['add_newprog'] = p1_u.id_program.lt(p2_u.id_program)
# Reset the index back to where we started:
df = df.reset_index()
print(df)
Output:
id_individual period rank id_institution id_program change_app change_order add_newinst add_newprog
0 1 1 0 100 101 True False False True
1 1 2 0 100 101 True False False True
2 1 2 1 100 102 True False False True
3 2 1 0 100 101 True False True True
4 2 2 0 100 101 True False True True
5 2 2 1 200 202 True False True True
6 3 1 0 100 101 False False False False
7 3 1 1 200 201 False False False False
8 3 2 0 100 101 False False False False
9 3 2 1 200 201 False False False False
10 4 1 0 100 101 True True False False
11 4 1 1 200 201 True True False False
12 4 2 0 200 201 True True False False
13 4 2 1 100 101 True True False False
~Large Frame Test - 200k IDs~
import pandas as pd
import numpy as np
from time import time
df = pd.DataFrame([[1, 1, 0, 100, 101], [1, 2, 0, 100, 101], [1, 2, 1, 100, 102], [2, 1, 0, 100, 101], [2, 2, 0, 100, 101], [2, 2, 1, 200, 202], [3, 1, 0, 100, 101], [3, 1, 1, 200, 201], [3, 2, 0, 100, 101], [3, 2, 1, 200, 201], [4, 1, 0, 100, 101], [4, 1, 1, 200, 201], [4, 2, 0, 200, 201], [4, 2, 1, 100, 101] ], columns=['id_individual', 'period', 'rank', 'id_institution', 'id_program'])
# 7 million rows, 200k individuals:
df = pd.concat([df.assign(id_individual=df.id_individual.add(4*x)) for x in range(50000)], ignore_index=True)
start = time()
config = {
'index': 'id_individual',
'columns': 'period',
'values': ['id_institution', 'id_program']
}
unique = df.pivot_table(**config, aggfunc=set)
names = ['id_individual', 'period', 'rank']
df = (df.set_index(names)
.reindex(pd.MultiIndex.from_product(df[names].apply(set), names=names))
.reset_index())
changes = df.pivot_table(**config, aggfunc=tuple)
df = df.set_index('id_individual').dropna().astype(int)
p1_c, p2_c = [changes.xs(x, level='period', axis=1) for x in (1,2)]
was_changed = p1_c.ne(p2_c).any(1)
df['change_app'] = was_changed
p1_u, p2_u = [unique.xs(x, level='period', axis=1) for x in (1,2)]
same_vals = p1_u.eq(p2_u).all(1)
df['change_order'] = was_changed & same_vals
df['add_newinst'] = p1_u.id_institution.lt(p2_u.id_institution)
df['add_newprog'] = p1_u.id_program.lt(p2_u.id_program)
df = df.reset_index()
print('Time Taken:', round(time()-start, 2), 'seconds')
# Output:
Time Taken: 12.87 seconds
I have this DataFrame
df = pd.DataFrame({'A': [100, 100, 300, 200, 200, 200], 'B': [60, 55, 12, 32, 15, 44], 'C': ['x', 'x', 'y', 'y', 'y', 'y']})
and I want to sort it by columns "A" and "B". "A" is always ascending. I also want ascending for "B" if "C == x", else descending for "B" if "C == y". So it would end up like this
df_sorted = pd.DataFrame({'A': [100, 100, 200, 200, 200, 300], 'B': [55, 60, 44, 32, 15, 12], 'C': ['x', 'x', 'y', 'y', 'y', 'y']})
I would filter each DataFrame into two Dataframe based on the value of C:
df_x = df.loc[df['C'] == 'x']
df_y = df.loc[df['C'] == 'y']
and then use "sort_values" like so:
df_x.sort_values(by=['A', 'B'], inplace=True)
sorting df_y will be different since you want one column ascending and the other descending, since "sort_values" is stable we can do it like so
df_y.sort_values(by=['A'], inplace=True)
df_y.sort_values(by=['b'], inplace=True, ascending=False)
You can then merge the DataFrames back and sort again by A and the order will remain.
You can set up a temporary column to invert the values of "B" when "C" equals "x", sort, and drop the column:
(df.assign(B2=df['B']*df['C'].eq('x').mul(2).sub(1))
.sort_values(by=['A', 'B2'])
.drop('B2', axis=1)
)
def function1(dd:pd.DataFrame):
return dd.sort_values(['A','B']) if dd.name=='x' else dd.sort_values(['A','B'],ascending=[True,False])
df.groupby('C').apply(function1).reset_index(drop=True)
A B C
0 100 55 x
1 100 60 x
2 200 44 y
3 200 32 y
4 200 15 y
5 300 12 y
I have:
df = pd.DataFrame(
[
[22, 33, 44],
[55, 11, 22],
[33, 55, 11],
],
index=["abc", "def", "ghi"],
columns=list("abc")
) # size(3,3)
and:
unique = pd.Series([11, 22, 33, 44, 55]) # size(1,5)
then I create a new df based on unique and df, so that:
df_new = pd.DataFrame(index=unique, columns=df.columns) # size(5,3)
From this newly created df, I'd like to create a new boolean df based on unique and df, so that the end result is:
df_new = pd.DataFrame(
[
[0, 1, 1],
[1, 0, 1],
[1, 1, 0],
[0, 0, 1],
[1, 1, 0],
],
index=unique,
columns=df.columns
)
This new df is either true or false depending on whether the value is present in the original dataframe or not. For example, the first column has three values: [22, 55, 33]. In a df with dimensions (5,3), this first column would be: [0, 1, 1, 0, 1] i.e. [0, 22, 33, 0 , 55]
I tried filter2 = unique.isin(df) but this doesn't work, also notnull. I tried applying a filter but the dimensions returned were incorrect. How can I do this?
Use DataFrame.stack with DataFrame.reset_index, DataFrame.pivot, then check if not missing values by DataFrame.notna, cast to integers for True->1 and False->0 mapping and last remove index and columns names by DataFrame.rename_axis:
df_new = (df.stack()
.reset_index(name='v')
.pivot('v','level_1','level_0')
.notna()
.astype(int)
.rename_axis(index=None, columns=None))
print (df_new)
a b c
11 0 1 1
22 1 0 1
33 1 1 0
44 0 0 1
55 1 1 0
Helper Series is not necessary, but if there is more values or is necessary change order by helper Series use add DataFrame.reindex:
#added 66
unique = pd.Series([11, 22, 33, 44, 55,66])
df_new = (df.stack()
.reset_index(name='v')
.pivot('v','level_1','level_0')
.reindex(unique)
.notna()
.astype(int)
.rename_axis(index=None, columns=None))
print (df_new)
a b c
11 0 1 1
22 1 0 1
33 1 1 0
44 0 0 1
55 1 1 0
66 0 0 0
I've got this data frame with some 'init' values ('value', 'value2') that I want to subtract to the mid term value 'mid' and final value 'final' once I've grouped by ID.
import pandas as pd
df = pd.DataFrame({
'value': [100, 120, 130, 200, 190,210],
'value2': [2100, 2120, 2130, 2200, 2190,2210],
'ID': [1, 1, 1, 2, 2, 2],
'state': ['init','mid', 'final', 'init', 'mid', 'final'],
})
My attempt was tho extract the index where I found 'init', 'mid' and 'final' and subtract from 'mid' and 'final' the value of 'init' once I've grouped the value by 'ID':
group = df.groupby('ID')
group['diff_1_f'] = group['value'].iloc[group.index[group['state'] == 'final'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]]
group['diff_2_f'] = group['value2'].iloc[group.index[group['state'] == 'final'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]
group['diff_1_m'] = group['value'].iloc[group.index[group['state'] == 'mid'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]
group['diff_2_m'] = group['value2'].iloc[group.index[group['state'] == 'mid'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]
But of course it doesn't work. How can I obtain the following result:
df = pd.DataFrame({
'diff_value': [20, 30, -10,10],
'diff_value2': [20, 30, -10,10],
'ID': [ 1, 1, 2, 2],
'state': ['mid', 'final', 'mid', 'final'],
})
Also in it's grouped form.
Use:
#columns names in list for subtract
cols = ['value', 'value2']
#new columns names created by join
new = [c + '_diff' for c in cols]
#filter rows with init
m = df['state'].ne('init')
#add init rows to new columns by join and filter no init rows
df1 = df.join(df[~m].set_index('ID')[cols], lsuffix='_diff', on='ID')[m]
#subtract with numpy array by .values for prevent index alignment
df1[new] = df1[new].sub(df1[cols].values)
#remove helper columns
df1 = df1.drop(cols, axis=1)
print (df1)
value_diff value2_diff ID state
1 20 20 1 mid
2 30 30 1 final
4 -10 -10 2 mid
5 10 10 2 final