Matches not found by pd.DataFrame.merge - pandas

I've got a three pd.DataFrames:
df1 = pd.DataFrame({'var1': {0: 2210, 1: 2210, 2: 2210, 3: 2210, 4: 2210, 5: 2210, 6: 2210, 7: 2210, 8: 2210, 9: 2210, 10: 2210, 11: 2210, 12: 2210, 13: 2210, 14: 2210, 15: 2210, 16: 2210, 17: 2210, 18: 2210, 19: 2210, 20: 2210, 21: 2210}, 'var2': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 2, 6: 1, 7: 2, 8: 1, 9: 2, 10: 1, 11: 2, 12: 1, 13: 2, 14: 1, 15: 2, 16: 1, 17: 2, 18: 1, 19: 2, 20: 1, 21: 2}, 'var3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}, 'var4': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}, 'var5': {0: '121160', 1: '20066', 2: ' 58621', 3: ' 201084', 4: ' 100180', 5: ' 74230', 6: ' 27789', 7: ' 66975', 8: ' 57410', 9: ' 49413', 10: ' 57112', 11: ' 19188', 12: ' 61366', 13: ' 27341', 14: ' 59859', 15: ' 173954', 16: ' 205651', 17: ' 54861', 18: ' 165809', 19: ' 60252', 20: ' 182156', 21: ' 82403'}})
df2 = pd.DataFrame({'var1': {349176: 2210, 349225: 2210, 349913: 2210, 350247: 2210, 350342: 2210, 350518: 2210}, 'var2': {349176: 2, 349225: 1, 349913: 1, 350247: 2, 350342: 1, 350518: 2}, 'var5': {349176: 58786.0, 349225: 37572.0, 349913: 103955.0, 350247: 19197.0, 350342: 14664.0, 350518: 75773.0}, 'var3': {349176: 19, 349225: 22, 349913: 56, 350247: 75, 350342: 80, 350518: 95}, 'var4': {349176: 8, 349225: 52, 349913: 42, 350247: 0, 350342: 50, 350518: 17}})
df3 = pd.DataFrame({'var1': {349175: 2210, 349224: 2210, 349912: 2210, 350246: 2210, 350341: 2210, 350517: 2210, 350521: 2210}, 'var2': {349175: 2, 349224: 1, 349912: 1, 350246: 2, 350341: 1, 350517: 2, 350521: 1}, 'var5': {349175: 19188.0, 349224: 205651.0, 349912: 59859.0, 350246: 27341.0, 350341: 165809.0, 350517: 19197.0, 350521: 61366.0}, 'var6': {349175: 19, 349224: 22, 349912: 56, 350246: 75, 350341: 80, 350517: 95, 350521: 95}, 'var7': {349175: 8, 349224: 52, 349912: 42, 350246: 0, 350341: 50, 350517: 17, 350521: 40}})
I need to stack df1 and df2 together, then join them by left join with df3 based on multiple variables: var1, var2, var5.
So I wrote:
pd.concat([df1, df2], axis = 0, sort = False).merge(df3, how = 'left', on = ['var1', 'var2', 'var5'])
but it doesn't find all the matching rows. Changing the type to outer join we can observe there's is for example two rows with the same values of var1, var2 and var3 - rows 11th and 28th, but they haven't been joined:
pd.concat([df1, df2], axis = 0, sort = False).merge(df3, how = 'outer', on = ['var1', 'var2', 'var5'])
I'm struggling to find a reason for that behaviour. I thought maybe data types are different within joining columns, but no - they are the same. I'm relatively new to Pandas, so maybe I'm missing something obvious here? What is the reason for that (unexpected) behaviour?

df1 = pd.DataFrame({'var1': {0: 2210, 1: 2210, 2: 2210, 3: 2210, 4: 2210, 5: 2210, 6: 2210, 7: 2210, 8: 2210, 9: 2210, 10: 2210, 11: 2210, 12: 2210, 13: 2210, 14: 2210, 15: 2210, 16: 2210, 17: 2210, 18: 2210, 19: 2210, 20: 2210, 21: 2210}, 'var2': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 2, 6: 1, 7: 2, 8: 1, 9: 2, 10: 1, 11: 2, 12: 1, 13: 2, 14: 1, 15: 2, 16: 1, 17: 2, 18: 1, 19: 2, 20: 1, 21: 2}, 'var3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}, 'var4': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}, 'var5': {0: '121160', 1: '20066', 2: ' 58621', 3: ' 201084', 4: ' 100180', 5: ' 74230', 6: ' 27789', 7: ' 66975', 8: ' 57410', 9: ' 49413', 10: ' 57112', 11: ' 19188', 12: ' 61366', 13: ' 27341', 14: ' 59859', 15: ' 173954', 16: ' 205651', 17: ' 54861', 18: ' 165809', 19: ' 60252', 20: ' 182156', 21: ' 82403'}})
df2 = pd.DataFrame({'var1': {349176: 2210, 349225: 2210, 349913: 2210, 350247: 2210, 350342: 2210, 350518: 2210}, 'var2': {349176: 2, 349225: 1, 349913: 1, 350247: 2, 350342: 1, 350518: 2}, 'var5': {349176: 58786.0, 349225: 37572.0, 349913: 103955.0, 350247: 19197.0, 350342: 14664.0, 350518: 75773.0}, 'var3': {349176: 19, 349225: 22, 349913: 56, 350247: 75, 350342: 80, 350518: 95}, 'var4': {349176: 8, 349225: 52, 349913: 42, 350247: 0, 350342: 50, 350518: 17}})
df3 = pd.DataFrame({'var1': {349175: 2210, 349224: 2210, 349912: 2210, 350246: 2210, 350341: 2210, 350517: 2210, 350521: 2210}, 'var2': {349175: 2, 349224: 1, 349912: 1, 350246: 2, 350341: 1, 350517: 2, 350521: 1}, 'var5': {349175: 19188.0, 349224: 205651.0, 349912: 59859.0, 350246: 27341.0, 350341: 165809.0, 350517: 19197.0, 350521: 61366.0}, 'var6': {349175: 19, 349224: 22, 349912: 56, 350246: 75, 350341: 80, 350517: 95, 350521: 95}, 'var7': {349175: 8, 349224: 52, 349912: 42, 350246: 0, 350341: 50, 350517: 17, 350521: 40}})
pd.concat([df1, df2], axis = 0).dtypes
results in
var1 int64
var2 int64
var3 int64
var4 int64
var5 object
dtype: object
As you can see after the concat the var5 is an object. If you merge at this point you will get no results as var5 in df3 is a float.
Here is what I would recommend:
df1['var5'] = df1['var5'].astype(float)
df2['var5'] = df2['var5'].astype(float)
df3['var5'] = df3['var5'].astype(float)
pd.concat([df1, df2], axis = 0).merge(df3, how = 'left', on = ['var1', 'var2', 'var5'])
This will produce the following DataFrame:
var1 var2 var3 var4 var5 var6 var7
0 2210 1 0 0 121160.0 NaN NaN
1 2210 2 0 0 20066.0 NaN NaN
2 2210 1 0 0 58621.0 NaN NaN
3 2210 2 0 0 201084.0 NaN NaN
4 2210 1 0 0 100180.0 NaN NaN
5 2210 2 0 0 74230.0 NaN NaN
6 2210 1 0 0 27789.0 NaN NaN
7 2210 2 0 0 66975.0 NaN NaN
8 2210 1 0 0 57410.0 NaN NaN
9 2210 2 0 0 49413.0 NaN NaN
10 2210 1 0 0 57112.0 NaN NaN
11 2210 2 0 0 19188.0 19.0 8.0
12 2210 1 0 0 61366.0 95.0 40.0
13 2210 2 0 0 27341.0 75.0 0.0
14 2210 1 0 0 59859.0 56.0 42.0
15 2210 2 0 0 173954.0 NaN NaN
16 2210 1 0 0 205651.0 22.0 52.0
17 2210 2 0 0 54861.0 NaN NaN
18 2210 1 0 0 165809.0 80.0 50.0
19 2210 2 0 0 60252.0 NaN NaN
20 2210 1 0 0 182156.0 NaN NaN
21 2210 2 0 0 82403.0 NaN NaN
22 2210 2 19 8 58786.0 NaN NaN
23 2210 1 22 52 37572.0 NaN NaN
24 2210 1 56 42 103955.0 NaN NaN
25 2210 2 75 0 19197.0 95.0 17.0
26 2210 1 80 50 14664.0 NaN NaN
27 2210 2 95 17 75773.0 NaN NaN

When I ran your code on my computer, then used df#.dtypes to get the types, the dtype of the var5 column in df1 is object, whereas in df2 and df3 it's float64. The concat runs fine with this (and after the concat, the dtype is object), but when I tried to run the merge (outer or left), I got a ValueError:
ValueError: You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat
I'd suggest double checking the types again (I know you already checked that). If they really are the same on your computer, I'm not sure what's going on.

Related

How to keep the number and names of columns in training and test dataset equal after one hot encoding?

Shape of the original dataset is 82580×30 with multiple string columns. Example dataset:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
df = pd.DataFrame({'Nationality': {0: 'DEU', 1: 'PRT', 2: 'PRT', 3: 'PRT', 4: 'FRA', 5: 'DEU', 6: 'CHE', 7: 'DEU', 8: 'GBR', 9: 'AUT', 10: 'PRT', 11: 'FRA', 12: 'OTR', 13: 'GBR', 14: 'ESP', 15: 'PRT', 16: 'OTR', 17: 'PRT', 18: 'ESP', 19: 'AUT'},
'Age': {0: 27.0, 1: 45.46, 2: 45.46, 3: 58.0, 4: 57.0, 5: 27.0, 6: 49.0, 7: 62.0, 8: 44.0, 9: 61.0, 10: 54.0, 11: 53.0, 12: 50.0, 13: 30.0, 14: 51.0, 15: 45.46, 16: 40.0, 17: 49.0, 18: 49.0, 19: 14.0},
'DaysSinceCreation': {0: 370, 1: 213, 2: 206, 3: 1018, 4: 835, 5: 52, 6: 597, 7: 217, 8: 999, 9: 1004, 10: 402, 11: 879, 12: 393, 13: 923, 14: 249, 15: 52, 16: 159, 17: 929, 18: 49, 19: 131},
'BookingsCheckedIn': {0: 1, 1: 0, 2: 0, 3: 1, 4: 1, 5: 1, 6: 1, 7: 2, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 0, 16: 0, 17: 1, 18: 1, 19: 0}})
# Encoding Variables
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')
transformed = transformer.fit_transform(df)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
transformed_df.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df = pd.concat([transformed_df, df], axis=1)
# Remove old columns
df.drop(['Nationality'], axis = 1, inplace = True)
print('The shape after encoding: {}'.format(df.shape))
print(df.columns.unique())
The shape after encoding: (20, 14)
Index(['onehotencoder__Nationality_AUT', 'onehotencoder__Nationality_CHE',
'onehotencoder__Nationality_DEU', 'onehotencoder__Nationality_ESP',
'onehotencoder__Nationality_FRA', 'onehotencoder__Nationality_GBR',
'onehotencoder__Nationality_OTR', 'onehotencoder__Nationality_PRT',
'remainder__Age', 'remainder__DaysSinceCreation',
'remainder__BookingsCheckedIn', 'Age', 'DaysSinceCreation',
'BookingsCheckedIn'],
dtype='object')
After modeling, trying to test on a completely different test set:
df = pd.DataFrame({'Nationality': {0: 'CAN', 1: 'DEU', 2: 'PRT', 3: 'PRT', 4: 'FRA'},
'Age': {0: 27.0, 1: 29.0, 2: 24.0, 3: 24.0, 4: 46.0},
'DaysSinceCreation': {0: 222, 1: 988, 2: 212, 3: 685, 4: 1052},
'BookingsCheckedIn': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}})
# Encoding Variables
transformer = make_column_transformer(
(OneHotEncoder(sparse=False), ['Nationality']),
remainder='passthrough')
transformed = transformer.fit_transform(df)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
transformed_df.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df = pd.concat([transformed_df, df], axis=1)
# Remove old columns
df.drop(['Nationality'], axis = 1, inplace = True)
print('The shape after encoding: {}'.format(df.shape))
print(df.columns.unique())
The shape after encoding: (5, 10)
Index(['onehotencoder__Nationality_CAN', 'onehotencoder__Nationality_DEU',
'onehotencoder__Nationality_FRA', 'onehotencoder__Nationality_PRT',
'remainder__Age', 'remainder__DaysSinceCreation',
'remainder__BookingsCheckedIn', 'Age', 'DaysSinceCreation',
'BookingsCheckedIn'],
dtype='object')
As can be seen, testing dataset has some features that were not present in the original training set and many features of training set are not present in test set. If I only use .values of X_train, y_train, X_test, y_test, I can run from logistic regression to Neural Net with >99% accuracy, but that feels like cheating and is not working out with Decision Trees. How do we deal with this?
I would like to contribute 2 inputs:
(1) the test set should be a subset of the training set, so the unknown Nationality 'CAN' is not allowed. Either: try to include the new 'CAN' in the training data, or try to replace it with 'GBR' instead in the test data.
(2) you should not do fit_transform() separately on training and test set. The right way is to fit on training set, then... transform on training set and transform on test set. To illustrate:
# Encoding Variables
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')
####transformed = transformer.fit_transform(df) #delete this
transformer.fit(df) #use this instead
transformed = transformer.transform(df) #use this instead
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
<truncated>
print('The shape after encoding: {}'.format(df.shape))
The shape after encoding: (20, 14)
Second part, note that I have replaced 'CAN' with 'GBR'. And only use the previously fitted transformer to transform the test set:
df = pd.DataFrame({'Nationality': {0: 'GBR', 1: 'DEU', 2: 'PRT', 3: 'PRT', 4: 'FRA'},
'Age': {0: 27.0, 1: 29.0, 2: 24.0, 3: 24.0, 4: 46.0},
'DaysSinceCreation': {0: 222, 1: 988, 2: 212, 3: 685, 4: 1052},
'BookingsCheckedIn': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}})
# Encoding Variables
####transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough') #do not repeat, use the previous fitted model
####transformed = transformer.fit_transform(df) #delete this, NO fitting on test set
transformed = transformer.transform(df) #only do transform on test set
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
<truncated>
print('The shape after encoding: {}'.format(df.shape))
The shape after encoding: (5, 14)
So the number of columns (14) are the same for both training set and test set

multiple nested groupby in pandas

Here is my pandas dataframe:
df = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-12',6: '2016-10-12',7: '2016-10-12',8: '2016-10-12',9: '2016-10-12'}, 'Stock': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H',8: 'I', 9:'J'}, 'Sector': {0: 0,1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6:0, 7:0, 8:1, 9:1}, 'Segment': {0: 0, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6:2,7:2,8:3,9:3}, 'Range': {0: 5, 1: 0, 2: 1, 3: 0, 4: 2, 5: 6, 6:0, 7:23, 8:5, 9:5}})
Here is how it looks:
I want to add the following columns:
'Date_Range_Avg': average of 'Range' grouped by Date
'Date_Sector_Range_Avg': average of 'Range' grouped by Date and Sector
'Date_Segment_Range_Avg': average of 'Range' grouped by Date and Segment
This would be the output:
res = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-12',6: '2016-10-12',7: '2016-10-12',8: '2016-10-12',9: '2016-10-12'}, 'Stock': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H',8: 'I', 9:'J'}, 'Sector': {0: 0,1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6:0, 7:0, 8:1, 9:1}, 'Segment': {0: 0, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6:2,7:2,8:3,9:3}, 'Range': {0: 5, 1: 0, 2: 1, 3: 0, 4: 2, 5: 6, 6:0, 7:23, 8:5, 9:5}, 'Date_Range_Avg':{0: 1.6, 1: 1.6, 2: 1.6, 3: 1.6, 4: 1.6, 5: 7.8, 6: 7.8, 7: 7.8, 8:7.8, 9: 7.8}, 'Date_Sector_Range_Avg':{0: 2.5, 1: 2.5, 2: 1, 3: 1, 4: 1, 5: 9.67, 6: 9.67, 7: 9.67, 8: 9.67, 9: 9.67}, 'Date_Segment_Range_Avg':{0: 5, 1: 0.75, 2: 0.75, 3: 0.75, 4: 0.75, 5: 6, 6: 11.5, 7: 11.5, 8: 5, 9: 5}})
This is how it looks:
Note I have rounded some of the values - but this rounding is not essential for the question I have (please feel free to not round)
I'm aware that I can do each of these groupings separately but it strikes me as inefficient (my dataset contains millions of rows)
Essentially, I would like to first do a grouping by Date and then re-use it to do the two more fine-grained groupings by Date and Segment and by Date and Sector.
How to do this?
My initial hunch is to go like this:
day_groups = df.groupby("Date")
df['Date_Range_Avg'] = day_groups['Range'].transform('mean')
and then to re-use day_groups to do the 2 more fine-grained groupbys like this:
df['Date_Sector_Range_Avg'] = day_groups.groupby('Segment')[Range].transform('mean')
Which doesn't work as you get:
'AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby''
groupby runs really fast when the aggregate function is vectorized. If you are worried about performance, try it out first to see if it's the real bottleneck in your program.
You can create temporary data frames holding the result of each groupby, then successively merge them with df:
group_bys = {
"Date_Range_Avg": ["Date"],
"Date_Sector_Range_Avg": ["Date", "Sector"],
"Date_Segment_Range_Avg": ["Date", "Segment"]
}
tmp = [
df.groupby(columns)["Range"].mean().to_frame(key)
for key, columns in group_bys.items()
]
result = df
for t in tmp:
result = result.merge(t, left_on=t.index.names, right_index=True)
Result:
Date Stock Sector Segment Range Date_Range_Avg Date_Sector_Range_Avg Date_Segment_Range_Avg
0 2016-10-11 A 0 0 5 1.6 2.500000 5.00
1 2016-10-11 B 0 1 0 1.6 2.500000 0.75
2 2016-10-11 C 1 1 1 1.6 1.000000 0.75
3 2016-10-11 D 1 1 0 1.6 1.000000 0.75
4 2016-10-11 E 1 1 2 1.6 1.000000 0.75
5 2016-10-12 F 0 1 6 7.8 9.666667 6.00
6 2016-10-12 G 0 2 0 7.8 9.666667 11.50
7 2016-10-12 H 0 2 23 7.8 9.666667 11.50
8 2016-10-12 I 1 3 5 7.8 5.000000 5.00
9 2016-10-12 J 1 3 5 7.8 5.000000 5.00
Another option is to use transform, and avoid the multiple merges:
# reusing your code
group_bys = {
"Date_Range_Avg": ["Date"],
"Date_Sector_Range_Avg": ["Date", "Sector"],
"Date_Segment_Range_Avg": ["Date", "Segment"]
}
tmp = {key : df.groupby(columns)["Range"].transform('mean')
for key, columns in group_bys.items()
}
df.assign(**tmp)

Split dataframe colum by content

How can I separate this data column by 'A','B' ...?
The first column as an index must be retained.
df = pd.DataFrame(data)
df = df[['seconds', 'marker', 'data1', 'data2', 'data3']]
seconds,marker,data1,data2,data3
00001,A,3,3,0,42,0
00002,B,3,3,0,34556,0
00003,C,3,3,0,42,0
00004,A,3,3,0,1833,0
00004,B,3,3,0,6569,0
00005,C,3,3,0,2454,0
00006,C,3,3,0,3256,0
00007,C,3,3,0,5423,0
00008,A,3,3,0,569,0
You can just get the unique values in the letter column (that's what I called it). And then filter the DataFrame containing all values using these unique values.
I am storing the newly created DataFrames in a dictionary here, but you could also store them in a list or whatever. I've used the input you have provided but have given the first 2 columns the names index and letter as they were unnamed in your .csv.
import pandas as pd
df = pd.DataFrame({
'index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8},
'letter': {0: 'A', 1: 'B', 2: 'C', 3: 'A', 4: 'B', 5: 'C', 6: 'C', 7: 'C', 8: 'A'},
'seconds': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3},
'marker': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3},
'data1': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'data2': {0: 42, 1: 34556, 2: 42, 3: 1833, 4: 6569, 5: 2454, 6: 3256, 7: 5423, 8: 569},
'data3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0}
})
# get unique values
unique_values = df["letter"].unique()
# filter "big" dataframe using one of the unique value at a time
split_dfs = {value: df[df["letter"] == value] for value in unique_values}
print(split_dfs["A"])
print(split_dfs["B"])
print(split_dfs["C"])
Expected output:
index letter seconds marker data1 data2 data3
0 1 A 3 3 0 42 0
3 4 A 3 3 0 1833 0
8 8 A 3 3 0 569 0
index letter seconds marker data1 data2 data3
1 2 B 3 3 0 34556 0
4 4 B 3 3 0 6569 0
index letter seconds marker data1 data2 data3
2 3 C 3 3 0 42 0
5 5 C 3 3 0 2454 0
6 6 C 3 3 0 3256 0
7 7 C 3 3 0 5423 0
As you can see the index is preserved.

APOPT solver finds different solution every time

I'm using gekko to solve a MINLP problem. I'm using the APOPT solver since is the only one that can provide integer solution, which are strictly needed in my case.
The issue I have is that every time I run the solver I have a different solution, even for very small cases, so I can't be sure about the optimality of the solutions. From some solutions to others there are important differences in the objective final value.
I've noticed that only 1 iteration takes place, and I don't know if it should be like this. Also, it runs for less than 1 second when it could run longer and find better solutions.
Below the output of one of this runs:
--------- APM Model Size ------------
Each time step contains
Objects : 86
Constants : 0
Variables : 606
Intermediates: 0
Connections : 806
Equations : 1046
Residuals : 1046
Number of state variables: 606
Number of total equations: - 172
Number of slack variables: - 40
---------------------------------------
Degrees of freedom : 394
----------------------------------------------
Steady State Optimization with APOPT Solver
----------------------------------------------
Iter: 1 I: 0 Tm: 0.38 NLPi: 28 Dpth: 0 Lvs: 0 Obj: 1.27E+09 Gap: 0.00E+00
Successful solution
---------------------------------------------------
Solver : APOPT (v1.0)
Solution time : 0.38800000000628643 sec
Objective : 1271480486.0000000
Successful solution
---------------------------------------------------
I would appreciate any advice on how to tweak the solver so I can have consistent solutions. Thanks in advance
EDIT
I'm adding the code (reduced version). Basically what I'm trying to do is finding the best location for n warehouse based on the demand of m cities, which are represented by the corresponding hexagon in the h3 library.
from gekko import GEKKO
from random import randint
import pandas as pd
if __name__ == '__main__':
n_warehouses = 6
df_aggreg = pd.DataFrame.from_dict({'hex_id_basic': {32: 0, 12: 1, 2: 2, 3: 3, 22: 4, 24: 5, 38: 6, 8: 7, 19: 8, 27: 9, 21: 10, 25: 11, 28: 12, 26: 13, 29: 14, 30: 15, 31: 16, 1: 17, 20: 18, 18: 19, 17: 20, 33: 21, 35: 22, 14: 23, 13: 24, 11: 25, 10: 26, 36: 27, 6: 28, 5: 29, 4: 30, 16: 31, 34: 32, 0: 33, 23: 34, 15: 35, 9: 36, 37: 37, 7: 38, 39: 39}, 'value': {32: 1808, 12: 847, 2: 847, 3: 847, 22: 847, 24: 847, 38: 847, 8: 847, 19: 847, 27: 847, 21: 452, 25: 452, 28: 452, 26: 452, 29: 452, 30: 452, 31: 452, 1: 452, 20: 452, 18: 452, 17: 452, 33: 452, 35: 452, 14: 452, 13: 452, 11: 452, 10: 452, 36: 452, 6: 452, 5: 452, 4: 452, 16: 452, 34: 169, 0: 169, 23: 84, 15: 84, 9: 84, 37: 84, 7: 84, 39: 84}})
distance_matrix = [[0, 278320, 302712, 117682, 283287, 225645, 303065, 258900, 252165, 453768, 389125, 305694, 377415, 445329, 176671, 16098, 95378, 272352, 153948, 247011, 153063, 175620, 184734, 253592, 235204, 275271, 204377, 140151, 270207, 254950, 364642, 92383, 239928, 300635, 394936, 291140, 293377, 205417, 253778, 313127], [278320, 0, 144398, 187013, 244571, 107924, 257813, 279533, 334530, 195669, 205157, 84028, 109937, 187707, 427560, 265135, 188779, 292672, 307462, 280094, 408319, 281660, 131476, 440829, 46693, 204251, 233509, 147273, 263519, 267077, 185835, 310005, 254863, 323378, 119817, 407585, 280337, 171576, 280105, 261055], [302712, 144398, 0, 185183, 377007, 234161, 392715, 170354, 237346, 307577, 347861, 228110, 138929, 301200, 476655, 286696, 244493, 179154, 254571, 176769, 387398, 378208, 238178, 372115, 163476, 79788, 152001, 222356, 146928, 158337, 62584, 289416, 375428, 203381, 217686, 310387, 155900, 289533, 173468, 397435], [117682, 187013, 185183, 0, 291864, 179791, 311714, 170740, 192926, 379238, 347869, 240050, 272172, 370968, 293195, 101616, 87586, 185977, 129253, 162328, 222594, 233040, 150108, 262697, 153805, 162481, 109099, 101869, 173375, 162759, 246974, 123207, 265470, 218641, 306739, 255114, 197339, 195959, 167223, 320151], [283287, 244571, 377007, 291864, 0, 142846, 20212, 455483, 484791, 287216, 161617, 183085, 340463, 280631, 326676, 282534, 212575, 470828, 409823, 449527, 434084, 135211, 148574, 531347, 213849, 411617, 394742, 190552, 450777, 445332, 427673, 367985, 53066, 504293, 288485, 544788, 473256, 96874, 453151, 29897], [225645, 107924, 234161, 179791, 142846, 0, 158910, 325969, 365598, 232092, 169623, 84652, 214438, 223750, 344437, 216987, 130954, 340949, 308549, 322092, 374352, 178834, 41429, 440518, 71333, 272261, 268233, 88085, 317511, 314784, 285789, 286639, 147154, 374185, 190689, 432588, 338741, 64095, 324564, 164244], [303065, 257813, 392715, 311714, 20212, 158910, 0, 474649, 504627, 286841, 156847, 192075, 350549, 280649, 341260, 302511, 232768, 489976, 430017, 468858, 453490, 151205, 167295, 551499, 229266, 429328, 414077, 210234, 469552, 464393, 442147, 388101, 68990, 523448, 294341, 564878, 491892, 116277, 472394, 10359], [258900, 279533, 170354, 170740, 455483, 325969, 474649, 0, 68930, 467349, 478557, 358082, 308273, 460170, 428178, 244879, 258325, 15394, 131661, 13671, 271974, 403214, 307634, 213649, 272011, 90917, 61801, 266220, 26455, 12593, 202481, 195575, 434164, 48812, 379747, 140257, 39308, 358628, 6107, 482200], [252165, 334530, 237346, 192926, 484791, 365598, 504627, 68930, 0, 526826, 526392, 408559, 373685, 519349, 407413, 240840, 276462, 66186, 101574, 60620, 226811, 414422, 341075, 147167, 320118, 158889, 101274, 294543, 95127, 79800, 271409, 171262, 456912, 68984, 441013, 74111, 100302, 388712, 64684, 513007], [453768, 195669, 307577, 379238, 287216, 232092, 286841, 467349, 526826, 0, 136244, 148461, 178102, 8441, 573471, 443171, 358499, 479086, 502816, 469837, 595669, 392502, 273266, 636429, 225844, 383011, 426871, 313638, 447989, 454756, 317090, 500323, 328239, 507228, 90088, 600610, 461075, 278672, 468809, 282694], [389125, 205157, 347861, 347869, 161617, 169623, 156847, 478557, 526392, 136244, 0, 121715, 255074, 131581, 475633, 382466, 297470, 492617, 477097, 477118, 540843, 284992, 206940, 609805, 206849, 408806, 425937, 256909, 465228, 466469, 379467, 455895, 209773, 524589, 174173, 596319, 483539, 187375, 478271, 151054], [305694, 84028, 228110, 240050, 183085, 84652, 192075, 358082, 408559, 148461, 121715, 0, 158796, 140049, 428839, 295540, 210329, 371886, 368057, 357279, 450043, 256851, 125251, 502181, 89028, 287097, 307444, 165570, 343975, 345858, 265337, 357240, 207855, 403529, 110982, 479964, 361969, 139306, 358064, 193009], [377415, 109937, 138929, 272172, 340463, 214438, 350549, 308273, 373685, 178102, 255074, 158796, 0, 173125, 534732, 362967, 293628, 317765, 374520, 313604, 493385, 391108, 241276, 501197, 156178, 218494, 280742, 254733, 285601, 295994, 139047, 393079, 358593, 342270, 90310, 447455, 294687, 278510, 310960, 351759], [445329, 187707, 301200, 370968, 280631, 223750, 280649, 460170, 519349, 8441, 131581, 140049, 173125, 0, 565355, 434730, 350062, 472002, 494725, 462533, 587263, 384786, 264901, 628392, 217517, 376220, 419266, 305200, 440993, 447576, 311937, 491975, 321105, 500318, 84177, 593082, 454296, 270683, 461573, 276700], [176671, 427560, 476655, 293195, 326676, 344437, 341260, 428178, 407413, 573471, 475633, 428839, 534732, 565355, 0, 192659, 241685, 440309, 306114, 415366, 213722, 192267, 305547, 353506, 381057, 451860, 378110, 280381, 442537, 425977, 539074, 236503, 273614, 465255, 533298, 426186, 464854, 295413, 422602, 351173], [16098, 265135, 286696, 101616, 282534, 216987, 302511, 244879, 240840, 443171, 382466, 295540, 362967, 434730, 192659, 0, 86037, 258560, 144482, 233219, 158401, 180937, 176621, 250690, 222751, 259395, 189468, 130132, 255508, 240568, 348587, 87437, 241145, 287389, 382525, 282778, 278823, 201018, 239867, 312431], [95378, 188779, 244493, 87586, 212575, 130954, 232768, 258325, 276462, 358499, 297470, 210329, 293628, 350062, 241685, 86037, 0, 273554, 197249, 249773, 243741, 146171, 90840, 321549, 143532, 241689, 196668, 44890, 260330, 250266, 306721, 160762, 180583, 306135, 302150, 332965, 284225, 121772, 254773, 241969], [272352, 292672, 179154, 185977, 470828, 340949, 489976, 15394, 66186, 479086, 492617, 371886, 317765, 472002, 440309, 258560, 273554, 0, 140627, 25344, 279225, 418280, 322922, 213131, 286282, 99366, 77144, 281612, 32238, 26277, 207287, 206362, 449552, 33473, 391005, 133626, 34477, 373979, 19000, 497504], [153948, 307462, 254571, 129253, 409823, 308549, 430017, 131661, 101574, 502816, 477097, 368057, 374520, 494725, 306114, 144482, 197249, 140627, 0, 117996, 141824, 323726, 276426, 134201, 280163, 191685, 105340, 226127, 152834, 133897, 305377, 69716, 375475, 161033, 424673, 138867, 170911, 317471, 125583, 439203], [247011, 280094, 176769, 162328, 449527, 322092, 468858, 13671, 60620, 469837, 477118, 357279, 313604, 462533, 415366, 233219, 249773, 25344, 117996, 0, 258348, 393955, 302253, 202640, 270296, 98356, 54784, 259602, 38118, 20016, 211973, 182205, 426898, 56395, 382945, 133870, 52974, 352657, 7588, 476592], [153063, 408319, 387398, 222594, 434084, 374352, 453490, 271974, 226811, 595669, 540843, 450043, 493385, 587263, 213722, 158401, 243741, 279225, 141824, 258348, 0, 314904, 334581, 140210, 370743, 331489, 244983, 286515, 294300, 275249, 442428, 100328, 387881, 294243, 528076, 223497, 311245, 358451, 265867, 463708], [175620, 281660, 378208, 233040, 135211, 178834, 151205, 403214, 414422, 392502, 284992, 256851, 391108, 384786, 192267, 180937, 146171, 418280, 323726, 393955, 314904, 0, 150414, 428871, 237099, 386147, 341933, 156045, 406310, 395685, 438153, 267684, 82364, 450314, 367218, 462586, 430241, 117601, 399353, 161423], [184734, 131476, 238178, 150108, 148574, 41429, 167295, 307634, 341075, 273266, 206940, 125251, 241276, 264901, 305547, 176621, 90840, 322922, 276426, 302253, 334581, 150414, 0, 406344, 86738, 264273, 247556, 50786, 302265, 297224, 294689, 249321, 137664, 356394, 227861, 405170, 324686, 53055, 305544, 174613], [253592, 440829, 372115, 262697, 531347, 440518, 551499, 213649, 147167, 636429, 609805, 502181, 501197, 628392, 353506, 250690, 321549, 213131, 134201, 202640, 140210, 428871, 406344, 0, 414364, 298374, 221122, 355576, 240067, 222654, 413797, 163640, 491733, 212409, 556786, 103608, 247427, 443322, 208464, 561173], [235204, 46693, 163476, 153805, 213849, 71333, 229266, 272011, 320118, 225844, 206849, 89028, 156178, 217517, 381057, 222751, 143532, 286282, 280163, 270296, 370743, 237099, 86738, 414364, 0, 208101, 219280, 101025, 259858, 260053, 214484, 274518, 217001, 318654, 159780, 391112, 279361, 131548, 271564, 233971], [275271, 204251, 79788, 162481, 411617, 272261, 429328, 90917, 158889, 383011, 408806, 287097, 218494, 376220, 451860, 259395, 241689, 99366, 191685, 98356, 331489, 386147, 264273, 298374, 208101, 0, 86592, 233845, 67147, 79218, 115829, 239530, 400051, 124610, 293938, 231168, 78076, 317319, 94401, 435547], [204377, 233509, 152001, 109099, 394742, 268233, 414077, 61801, 101274, 426871, 425937, 307444, 280742, 419266, 378110, 189468, 196668, 77144, 105340, 54784, 244983, 341933, 247556, 221122, 219280, 86592, 0, 204896, 66374, 53769, 200202, 155140, 372506, 110258, 342408, 174094, 90128, 297872, 58728, 421827], [140151, 147273, 222356, 101869, 190552, 88085, 210234, 266220, 294543, 313638, 256909, 165570, 254733, 305200, 280381, 130132, 44890, 281612, 226127, 259602, 286515, 156045, 50786, 355576, 101025, 233845, 204896, 0, 263729, 256622, 282929, 198993, 168664, 314976, 258111, 356656, 286974, 94168, 263557, 218490], [270207, 263519, 146928, 173375, 450777, 317511, 469552, 26455, 95127, 447989, 465228, 343975, 285601, 440993, 442537, 255508, 260330, 32238, 152834, 38118, 294300, 406310, 302265, 240067, 259858, 67147, 66374, 263729, 0, 19120, 176392, 213505, 432376, 59886, 359568, 165204, 23979, 354255, 31668, 476705], [254950, 267077, 158337, 162759, 445332, 314784, 464393, 12593, 79800, 454756, 466469, 345858, 295994, 447576, 425977, 240568, 250266, 26277, 133897, 20016, 275249, 395685, 297224, 222654, 260053, 79218, 53769, 256622, 19120, 0, 192090, 195450, 424902, 59414, 367175, 152072, 38895, 348532, 15183, 471835], [364642, 185835, 62584, 246974, 427673, 285789, 442147, 202481, 271409, 317090, 379467, 265337, 139047, 311937, 539074, 348587, 306721, 207287, 305377, 211973, 442428, 438153, 294689, 413797, 214484, 115829, 200202, 282929, 176392, 192090, 0, 346396, 430496, 224040, 228276, 340846, 177161, 344487, 206946, 445989], [92383, 310005, 289416, 123207, 367985, 286639, 388101, 195575, 171262, 500323, 455895, 357240, 393079, 491975, 236503, 87437, 160762, 206362, 69716, 182205, 100328, 267684, 249321, 163640, 274518, 239530, 155140, 198993, 213505, 195450, 346396, 0, 328145, 229377, 429805, 200778, 233973, 281930, 189695, 397850], [239928, 254863, 375428, 265470, 53066, 147154, 68990, 434164, 456912, 328239, 209773, 207855, 358593, 321105, 273614, 241145, 180583, 449552, 375475, 426898, 387881, 82364, 137664, 491733, 217001, 400051, 372506, 168664, 432376, 424902, 430496, 328145, 0, 482764, 317893, 512855, 455638, 86038, 431229, 79283], [300635, 323378, 203381, 218641, 504293, 374185, 523448, 48812, 68984, 507228, 524589, 403529, 342270, 500318, 465255, 287389, 306135, 33473, 161033, 56395, 294243, 450314, 356394, 212409, 318654, 124610, 110258, 314976, 59886, 59414, 224040, 229377, 482764, 0, 418433, 121305, 47807, 407440, 51547, 530978], [394936, 119817, 217686, 306739, 288485, 190689, 294341, 379747, 441013, 90088, 174173, 110982, 90310, 84177, 533298, 382525, 302150, 391005, 424673, 382945, 528076, 367218, 227861, 556786, 159780, 293938, 342408, 258111, 359568, 367175, 228276, 429805, 317893, 418433, 0, 515064, 371904, 249637, 381509, 293301], [291140, 407585, 310387, 255114, 544788, 432588, 564878, 140257, 74111, 600610, 596319, 479964, 447455, 593082, 426186, 282778, 332965, 133626, 138867, 133870, 223497, 462586, 405170, 103608, 391112, 231168, 174094, 356656, 165204, 152072, 340846, 200778, 512855, 121305, 515064, 0, 164624, 450245, 136922, 573716], [293377, 280337, 155900, 197339, 473256, 338741, 491892, 39308, 100302, 461075, 483539, 361969, 294687, 454296, 464854, 278823, 284225, 34477, 170911, 52974, 311245, 430241, 324686, 247427, 279361, 78076, 90128, 286974, 23979, 38895, 177161, 233973, 455638, 47807, 371904, 164624, 0, 376916, 45412, 498907], [205417, 171576, 289533, 195959, 96874, 64095, 116277, 358628, 388712, 278672, 187375, 139306, 278510, 270683, 295413, 201018, 121772, 373979, 317471, 352657, 358451, 117601, 53055, 443322, 131548, 317319, 297872, 94168, 354255, 348532, 344487, 281930, 86038, 407440, 249637, 450245, 376916, 0, 356278, 124356], [253778, 280105, 173468, 167223, 453151, 324564, 472394, 6107, 64684, 468809, 478271, 358064, 310960, 461573, 422602, 239867, 254773, 19000, 125583, 7588, 265867, 399353, 305544, 208464, 271564, 94401, 58728, 263557, 31668, 15183, 206946, 189695, 431229, 51547, 381509, 136922, 45412, 356278, 0, 480029], [313127, 261055, 397435, 320151, 29897, 164244, 10359, 482200, 513007, 282694, 151054, 193009, 351759, 276700, 351173, 312431, 241969, 497504, 439203, 476592, 463708, 161423, 174613, 561173, 233971, 435547, 421827, 218490, 476705, 471835, 445989, 397850, 79283, 530978, 293301, 573716, 498907, 124356, 480029, 0]]
# Initialize Model
m = GEKKO(remote=False)
m.options.SOLVER = 1
m.options.IMODE = 3
# VARIABLES
print("Creating variables...")
# warehouse locations
warehouses_to_hexagon_vars = []
for dks_id in range(n_warehouses):
warehouses_to_hexagon_vars.append([m.Var(value=0, lb=0, ub=1, integer=True, name=f"dk_{dks_id}_{hex_id}")
for hex_id in df_aggreg["hex_id_basic"].unique()])
# hexagon assigned to warehouse
hexagon_to_warehouse_vars = []
for hex_id in df_aggreg["hex_id_basic"].unique():
hexagon_to_warehouse_vars.append([m.Var(value=0, lb=0, ub=1, integer=True, name=f"hx_{hex_id}_{dks_id}")
for dks_id in range(n_warehouses)])
# CONSTRAINTS
# A warehouse located only in one hexagon
print("Creating constraints...")
for dks_vars in warehouses_to_hexagon_vars:
m.Equation(m.sum(dks_vars) == 1)
# One hexagon contains at most one warehouse
for hex_id in df_aggreg["hex_id_basic"].unique():
m.Equation(m.sum([dks_vars[hex_id] for dks_vars in warehouses_to_hexagon_vars]) <= 1)
# One hexagon assigned only to one warehouse
for hex_vars in hexagon_to_warehouse_vars:
m.Equation(m.sum(hex_vars) == 1)
# WARM START
for dks_id in range(n_warehouses):
warehouses_to_hexagon_vars[dks_id][randint(0, len(warehouses_to_hexagon_vars[dks_id]) - 1)].value = 1
for hex_id in range(len(hexagon_to_warehouse_vars)):
hexagon_to_warehouse_vars[hex_id][randint(0, n_warehouses - 1)].value = 1
# Set objective function
print("Creating objective function...")
for wh_id in range(n_warehouses):
for hex_id_1 in df_aggreg["hex_id_basic"].unique():
for hex_id_2 in df_aggreg["hex_id_basic"].unique():
distance_hexagon_to_warehouse = int(distance_matrix[hex_id_1][hex_id_2])
demand_hexagon = df_aggreg.loc[df_aggreg["hex_id_basic"] == hex_id_2, "value"].values[0]
m.Obj(warehouses_to_hexagon_vars[wh_id][hex_id_1] *
hexagon_to_warehouse_vars[hex_id_2][wh_id] *
distance_hexagon_to_warehouse *
demand_hexagon)
# Solve simulation
m.solve()
I've implemented a parallel solution in which I run the same model n times but with a different initial solution each. Once all models have been solved, I take the solution with lowest objective value.

Apply np.average in pandas pivot aggfunc

I am trying to calculate weighted average prices using pandas pivot table.
I have tried passing in a dictionary using aggfunc.
This does not work when passed into aggfunc, although it should calculate the correct weighted average.
'Price': lambda x: np.average(x, weights=df['Balance'])
I have also tried using a manual groupby:
df.groupby('Product').agg({
'Balance': sum,
'Price': lambda x : np.average(x, weights='Balance'),
'Value': sum
})
This also yields the error:
TypeError: Axis must be specified when shapes of a and weights differ.
Here is sample data
import pandas as pd
import numpy as np
price_dict = {'Product': {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'B',
6: 'B',
7: 'B',
8: 'B',
9: 'B',
10: 'C',
11: 'C',
12: 'C',
13: 'C',
14: 'C'},
'Balance': {0: 10,
1: 20,
2: 30,
3: 40,
4: 50,
5: 60,
6: 70,
7: 80,
8: 90,
9: 100,
10: 110,
11: 120,
12: 130,
13: 140,
14: 150},
'Price': {0: 1,
1: 2,
2: 3,
3: 4,
4: 5,
5: 6,
6: 7,
7: 8,
8: 9,
9: 10,
10: 11,
11: 12,
12: 13,
13: 14,
14: 15},
'Value': {0: 10,
1: 40,
2: 90,
3: 160,
4: 250,
5: 360,
6: 490,
7: 640,
8: 810,
9: 1000,
10: 1210,
11: 1440,
12: 1690,
13: 1960,
14: 2250}}
Try to calculate weighted average by passing dict into aggfunc:
df = pd.DataFrame(price_dict)
df.pivot_table(
index='Product',
aggfunc = {
'Balance': sum,
'Price': np.mean,
'Value': sum
}
)
Output:
Balance Price Value
Product
A 150 3 550
B 400 8 3300
C 650 13 8550
The expected outcome should be :
Balance Price Value
Product
A 150 3.66 550
B 400 8.25 3300
C 650 13.15 8550
Here is one way using apply
df.groupby('Product').apply(lambda x : pd.Series(
{'Balance': x['Balance'].sum(),
'Price': np.average(x['Price'], weights=x['Balance']),
'Value': x['Value'].sum()}))
Out[57]:
Balance Price Value
Product
A 150.0 3.666667 550.0
B 400.0 8.250000 3300.0
C 650.0 13.153846 8550.0