CPLEX: lowest possible gap isn't necessarily 0.00%? - optimization

this is a follow-up question to this question: Interpretation of GAP in CPLEX
I used the following Expression at the beginning of my optimization (min) problem:
execute gapTermination {
cplex.epgap = 0.00; // result at gap of 0%
}
This is a part of the engine log:
Nodes Cuts/
Node Left Objective IInf Best Integer Best Bound ItCnt Gap
0 0 560.7929 100 560.7929 115
0 0 742.1396 57 Cuts: 121 214
0 0 744.3119 61 Cuts: 10 226
0 0 747.2193 61 Cuts: 10 233
0 0 747.2797 61 MCF: 1 234
* 0+ 0 916.3811 747.2797 18.45%
0 2 747.2797 61 916.3811 747.2797 234 18.45%
Elapsed time = 0.13 sec. (49.77 ticks, tree = 0.00 MB, solutions = 1)
* 916 755 integral 0 778.9609 753.8931 7249 3.22%
* 4739 1918 integral 0 771.9166 759.5332 25884 1.60%
Cover cuts applied: 5
Implied bound cuts applied: 8
Flow cuts applied: 27
Mixed integer rounding cuts applied: 36
Multi commodity flow cuts applied: 1
Gomory fractional cuts applied: 22
Root node processing (before b&c):
Real time = 0.11 sec. (49.41 ticks)
Parallel b&c, 16 threads:
Real time = 0.38 sec. (202.30 ticks)
Sync time (average) = 0.07 sec.
Wait time (average) = 0.07 sec.
------------
Total (root+branch&cut) = 0.49 sec. (251.71 ticks)
As you can see, the "optimal" solution seems to be found, but it still has a gap of 1,60%.
How to Interpret this? My thought would be, that I found the optimal integer solution (no single integer solution left that's better), but the non-integer value achieves an even better result being 1.60% lower (minimization Problem).
If my thought is correct, then it would mean that a 0.00% gap can only be achieved if the optimum of the relaxed solution (usually non-integer) happens to be an integer value.
I'd really appreciate if someone could help me out here. Thanks in advance.

Related

Sklearn only predicts one class while dataset is fairly balanced (±80/20 split)

I am trying to come up with a way to check what are the most influential factors of a person not paying back a loan (defaulting). I have worked with the sklearn library quite intensively, but I feel like I am missing something quite trivial...
The dataframe looks like this:
0 7590-VHVEG Female Widowed Electronic check Outstanding loan 52000 20550 108 0.099 288.205374 31126.180361 0 No Employed No Dutch No 0
1 5575-GNVDE Male Married Bank transfer Other 42000 22370 48 0.083 549.272708 26365.089987 0 Yes Employed No Dutch No 0
2 3668-QPYBK Male Registered partnership Bank transfer Study 44000 24320 25 0.087 1067.134272 26678.356802 0 No Self-Employed No Dutch No 0
The distribution of the "DefaultInd" column (target variable) is this:
0 0.835408
1 0.164592
Name: DefaultInd, dtype: float64
I have label encoded the data to make it look like this, :
CustomerID Gender MaritalStatus PaymentMethod SpendingTarget EstimatedIncome CreditAmount TermLoanMonths YearlyInterestRate MonthlyCharges TotalAmountPayments CurrentLoans SustainabilityIndicator EmploymentStatus ExistingCustomer Nationality BKR_Registration DefaultInd
0 7590-VHVEG 0 4 2 2 52000 20550 108 0.099 288.205374 31126.180361 0 0 0 0 5 0 0
1 5575-GNVDE 1 1 0 1 42000 22370 48 0.083 549.272708 26365.089987 0 1 0 0 5 0 0
2 3668-QPYBK 1 2 0 4 44000 24320 25 0.087 1067.134272 26678.356802 0 0 2 0 5 0
After that I have removed NaNs and cleaned it up some more (removing capitalizion, punctuation etc)
After that, I try to run this cell:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
y = df['DefaultInd']
X = df.drop(['CustomerID','DefaultInd'],axis=1)
X = X.astype(float)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=42)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(classification_report(y_test, y_pred))
Which results in this:
precision recall f1-score support
0 0.83 1.00 0.91 1073
1 0.00 0.00 0.00 213
accuracy 0.83 1286
macro avg 0.42 0.50 0.45 1286
weighted avg 0.70 0.83 0.76 1286
As you can see, the "1" class does not get predicted 1 time, I am wondering whether or not this behaviour is to be expected (I think it is not). I tried to use class_weightd = ‘balanced’, but that resulted in an average f1 score of 0.59 (instead of 0.76)
I feel like I am missing something, or is this kind of behaviour expected and should I rebalance the dataset before fitting? I feel like the division is not that skewed (±80/20), there should not be this big of a problem.
Any help would be more than appreciated :)

nonlinear ODEs optimization with leastsq

[UPDATED] I'm working on a nonlinear ODEs system optimization and fitting it to experimental data. I have a system of 5 model ODEs which must be optimized by 17 parameters. My approach is to calculate the differences between solved ODEs and experimental data - function Differences, then use leastsq solver to minimize diferences and find the optimal parameters, as below code:
//RHSs of ODEs to be fitted:
function dx=model3(t,x,Kap,Ksa,Ko,Ks,Kia,Kis,p_Amax,q_Amax,qm,q_Smax,Yas,Yoa,Yxa,Yem,Yos,Yxsof,H)
X=x(1);
S=x(2);
A=x(3);
DO=x(4);
V=x(5);`
qs=((q_Smax*S/(S+Ks))*Kia/(Kia+A));
qsof=(p_Amax*qs/(qs+Kap));
qsox=(qs-qsof)*DO/(DO+Ko);
qsa=(q_Amax*A/(A+Ksa))*(Kis/(qs+Kis));
pa=qsof*Yas;
qa=pa-qsa;
qo=(qsox-qm)*Yos+qsa*Yoa;
u=(qsox-qm)*Yem+qsof*Yxsof+qsa*Yxa;
dx(1)=u*X-F*X/V;
dx(2)=(F*(Sf-S)/V)-qs*X;
dx(3)=qsa*X-(F*A/V);
dx(4)=200*(100-DO)-qo*X*H;
dx(5)=F;
endfunction
//experimental data:
//Dat=fscanfMat('dane_exper_III_etap.txt');
Dat = [
0 30 1.4 24.1 99 6884.754
1 35 0.2 23.2 89 6959.754
2 40 0.1 21.6 80 7034.754
3 52 0.1 19.5 67 7109.754
4 61 0.1 18.7 70 7184.754
5 66 0.1 16.4 79 7259.754
6 71 0.1 15 94 7334.754
7 74 0 14.3 100 7409.754
8 76 0 13.8 100 7484.754
9 78 0 13.4 100 7559.754
9.5 79 0 13.2 100 7597.254
10 79 0 13.5 100 7634.754]
t=Dat(:,1);
x_exp(:,1)=Dat(:,2);
x_exp(:,2)=Dat(:,3);
x_exp(:,3)=Dat(:,4);
x_exp(:,4)=Dat(:,5);
x_exp(:,5)=Dat(:,6);
global MYDATA;
MYDATA.t=t;
MYDATA.x_exp=x_exp;
MYDATA.funeval=0;
//calculating differences between calculated values and experimental data:
function f=Differences(k)
global MYDATA
t=MYDATA.t;
x_exp=MYDATA.x_exp;
Kap=k(1); //g/L
Ksa=k(2); //g/L
Ko=k(3); //g/L
Ks=k(4); //g/L
Kia=k(5); //g/L
Kis=k(6); //g/L
p_Amax=k(7); //g/(g*h)
q_Amax=k(8); //g/(g*h)
qm=k(9);
q_Smax=k(10);
Yas=k(11); //g/g
Yoa=k(12);
Yxa=k(13);
Yem=k(14);
Yos=k(15);
Yxsof=k(16);
H=k(17);
x0=x_exp(1,:);
t0=0;
F=75;
Sf=500;
%ODEOPTIONS=[1,0,0,%inf,0,2,10000,12,5,0,-1,-1]
x_calc=ode('rk',x0',t0,t,list(model3,Kap,Ksa,Ko,Ks,Kia,Kis,p_Amax,q_Amax,qm,q_Smax,Yas,Yoa,Yxa,Yem,Yos,Yxsof,H));
diffmat=x_calc'-x_exp;
//column vector of differences (concatenates 4 columns of the difference matrix)
f=diffmat(:);
MYDATA.funeval=MYDATA.funeval+1;
endfunction
// Initial guess
Kap=0.3; //g/L
Ksa=0.05; //g/L
Ko=0.1; //g/L
Ks=0.5; //g/L
Kia=0.5; //g/L
Kis=0.05; //g/L
p_Amax=0.4; //g/(g*h)
q_Amax=0.8; //g/(g*h)
qm=0.2;
q_Smax=0.6;
Yas=0.5; //g/g
Yoa=0.5;
Yxa=0.5;
Yem=0.5;
Yos=1.5;
Yxsof=0.22;
H=1000;
y0=[Kap;Ksa;Ko;Ks;Kia;Kis;p_Amax;q_Amax;qm;q_Smax;Yas;Yoa;Yxa;Yem;Yos;Yxsof;H];
yinf=[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100];
ysup=[%inf,%inf,%inf,%inf,%inf,%inf,3,3,3,3,3,3,3,3,3,3,10000];
[fopt,xopt,gopt]=leastsq(Differences,'b',yinf,ysup,y0);
Now result is:
0.2994018
0.0508325
0.0999987
0.4994088
0.5081272
0.
0.4004560
0.7050746
0.2774195
0.6068328
0.5
0.4926150
0.4053860
0.5255006
1.5018725
0.2193901
1000.0000
33591.642
Running this script causes such an error:
lsoda-- caution... t (=r1) and h (=r2) are
such that t + h = t at next step
(h = pas). integration continues
where r1 is : 0.5658105345269D+01 and r2 : 0.1884898700920D-17
lsoda-- previous message precedent given i1 times
will no more be repeated
where i1 is : 10
lsoda-- at t (=r1), mxstep (=i1) steps
needed before reaching tout
where i1 is : 500000
where r1 is : 0.5658105345270D+01
Excessive work done on this call (perhaps wrong jacobian type).
at line 27 of function Differences
I understand that problem is on ODEs solving step. Thus, I have tried changing the mxstep, as also solving method type to 'adams','rk', and 'stiff' - none of this solved the problem. Using 'fix' method in ode I get this error:
ode: rksimp exit with state 3.
Please advise how to solve this?
P.S. Experimental data in file 'dane_exper_III_etap.txt':
0 30 1.4 24.1 99 6884.754
1 35 0.2 23.2 89 6959.754
2 40 0.1 21.6 80 7034.754
3 52 0.1 19.5 67 7109.754
4 61 0.1 18.7 70 7184.754
5 66 0.1 16.4 79 7259.754
6 71 0.1 15 94 7334.754
7 74 0 14.3 100 7409.754
8 76 0 13.8 100 7484.754
9 78 0 13.4 100 7559.754
9.5 79 0 13.2 100 7597.254
10 79 0 13.5 100 7634.754
In Scilab leastsq (based on optim) is very poor and doesn't have global convergence properties, unlike ipopt which is available as an atoms module. Install it like this:
--> atomsInstall sci_ipopt
I have modified your script in the following way
keep classical use of ode for this kind of biological kinetics, i.e. "stiff" (which uses BDF method). The Runge-Kutta you were using is very poor as it is an explicit method only for gentle ODEs.
use ipopt instead of leastsq
use a try/catch/end block around the computation of the residual in order to catch failing calls to the ode solver.
use some weight for the residual. You should play with it in order to improve the fit
use a strictly positive lower bound instead of 0, as very low value of some parameters make the ode solver fail.
add a drawing callback that also saves current value of parameters in case where you stop the optimization with ctrl-C.
//RHSs of ODEs to be fitted:
function dx=model3(t,x,Kap,Ksa,Ko,Ks,Kia,Kis,p_Amax,q_Amax,qm,q_Smax,Yas,Yoa,Yxa,Yem,Yos,Yxsof,H)
X=x(1);
S=x(2);
A=x(3);
DO=x(4);
V=x(5);
qs=((q_Smax*S/(S+Ks))*Kia/(Kia+A));
qsof=(p_Amax*qs/(qs+Kap));
qsox=(qs-qsof)*DO/(DO+Ko);
qsa=(q_Amax*A/(A+Ksa))*(Kis/(qs+Kis));
pa=qsof*Yas;
qa=pa-qsa;
qo=(qsox-qm)*Yos+qsa*Yoa;
u=(qsox-qm)*Yem+qsof*Yxsof+qsa*Yxa;
dx(1)=u*X-F*X/V;
dx(2)=(F*(Sf-S)/V)-qs*X;
dx(3)=qsa*X-(F*A/V);
dx(4)=200*(100-DO)-qo*X*H;
dx(5)=F;
endfunction
//calculating differences between calculated values and experimental data:
function [f,x_calc]=Differences(k, t, x_exp)
Kap=k(1); //g/L
Ksa=k(2); //g/L
Ko=k(3); //g/L
Ks=k(4); //g/L
Kia=k(5); //g/L
Kis=k(6); //g/L
p_Amax=k(7); //g/(g*h)
q_Amax=k(8); //g/(g*h)
qm=k(9);
q_Smax=k(10);
Yas=k(11); //g/g
Yoa=k(12);
Yxa=k(13);
Yem=k(14);
Yos=k(15);
Yxsof=k(16);
H=k(17);
x0=x_exp(1,:);
t0=0;
F=75;
Sf=500;
[x_calc]=ode("stiff",x0',t0,t,list(model3,Kap,Ksa,Ko,Ks,Kia,Kis,p_Amax,q_Amax,qm,q_Smax,Yas,Yoa,Yxa,Yem,Yos,Yxsof,H));
diffmat=(x_calc'-x_exp)*residual_weight;
//column vector of differences (concatenates 4 columns of the difference matrix)
f=diffmat(:);
MYDATA.funeval=MYDATA.funeval+1;
endfunction
function [f,g]=normdiff2(k,new_k,t,x_exp)
try
res = Differences(k,t,x_exp)
if argn(1) == 2
JacRes = numderivative(list(Differences,t,x_exp),k)
g = 2*JacRes'*res;
end
f = sum(res.*res)
catch
f=%inf;
g=%inf*ones(k);
end
endfunction
function out=callback(param)
global MYDATA
if isfield(param,"x")
k = param.x;
MYDATA.k = k;
[f,x_calc]=Differences(k,t,x_exp)
plot_weight = diag(1./max(x_exp,'r'));
drawlater
clf
plot(t,x_exp*plot_weight,'-o')
plot(t,x_calc'*plot_weight,'-x')
legend X S A DO X
drawnow
end
out = %t;
endfunction
//experimental data:
//Dat=fscanfMat('dane_exper_III_etap.txt');
Dat = [
0 30 1.4 24.1 99 6884.754
1 35 0.2 23.2 89 6959.754
2 40 0.1 21.6 80 7034.754
3 52 0.1 19.5 67 7109.754
4 61 0.1 18.7 70 7184.754
5 66 0.1 16.4 79 7259.754
6 71 0.1 15 94 7334.754
7 74 0 14.3 100 7409.754
8 76 0 13.8 100 7484.754
9 78 0 13.4 100 7559.754
9.5 79 0 13.2 100 7597.254
10 79 0 13.5 100 7634.754]
t=Dat(:,1);
x_exp(:,1)=Dat(:,2);
x_exp(:,2)=Dat(:,3);
x_exp(:,3)=Dat(:,4);
x_exp(:,4)=Dat(:,5);
x_exp(:,5)=Dat(:,6);
global MYDATA;
MYDATA.funeval=0;
// Initial guess
Kap=0.3; //g/L
Ksa=0.05; //g/L
Ko=0.1; //g/L
Ks=0.5; //g/L
Kia=0.5; //g/L
Kis=0.05; //g/L
p_Amax=0.4; //g/(g*h)
q_Amax=0.8; //g/(g*h)
qm=0.2;
q_Smax=0.6;
Yas=0.5; //g/g
Yoa=0.5;
Yxa=0.5;
Yem=0.5;
Yos=1.5;
Yxsof=0.22;
H=100;
k0 = [Kap;Ksa;Ko;Ks;Kia;Kis;p_Amax;q_Amax;qm;q_Smax;Yas;Yoa;Yxa;Yem;Yos;Yxsof;H];
residual_weight = diag(1./[79,1.4, 24.1, 100, 7634.754]);
BIG = 1000;
SMALL = 1e-3;
problem = struct();
problem.f = list(normdiff2,t,x_exp);
problem.x0 = k0;
problem.x_lower = [SMALL*ones(16,1);100];
problem.x_upper = [BIG,BIG,BIG,BIG,BIG,BIG,3,3,3,3,3,3,3,3,3,3,10000]';
problem.int_cb = callback;
problem.params = struct("max_iter",200);
//
k = ipopt(problem)
Here is a plot of the results after 100 iterations (you can change the value in the ipopt options). However, don't expect normal termination as
it is almost sure that your parameter set is not identifiable
finite difference gradient with ODEs is very innacurate.
Hope it will help you a little bit...

Select every nth row as a Pandas DataFrame without reading the entire file

I am reading a large file that contains ~9.5 million rows x 16 cols.
I am interested in retrieving a representative sample, and since the data is organized by time, I want to do this by selecting every 500th element.
I am able to load the data, and then select every 500th row.
My question: Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Question 2: How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
Here is a snippet of what the data looks like (first five rows) The first 4 rows are out of order, bu the remaining dataset looks ordered (by time):
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 1 2017-01-09 11:13:28 2017-01-09 11:25:45 1 3.30 1 N 263 161 1 12.5 0.0 0.5 2.00 0.00 0.3 15.30
1 1 2017-01-09 11:32:27 2017-01-09 11:36:01 1 0.90 1 N 186 234 1 5.0 0.0 0.5 1.45 0.00 0.3 7.25
2 1 2017-01-09 11:38:20 2017-01-09 11:42:05 1 1.10 1 N 164 161 1 5.5 0.0 0.5 1.00 0.00 0.3 7.30
3 1 2017-01-09 11:52:13 2017-01-09 11:57:36 1 1.10 1 N 236 75 1 6.0 0.0 0.5 1.70 0.00 0.3 8.50
4 2 2017-01-01 00:00:00 2017-01-01 00:00:00 1 0.02 2 N 249 234 2 52.0 0.0 0.5 0.00 0.00 0.3 52.80
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Something you could do is to use the skiprows parameter in read_csv, which accepts a list-like argument to discard the rows of interest (and thus, also select). So you could create a np.arange with a length equal to the amount of rows to read, and remove every 500th element from it using np.delete, so this way we'll only be reading every 500th row:
n_rows = 9.5e6
skip = np.arange(n_rows)
skip = np.delete(skip, np.arange(0, n_rows, 500))
df = pd.read_csv('my_file.csv', skiprows = skip)
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
First get the length of the file by a custom function, remove each 500 row by numpy.setdiff1d and pass to the skiprows parameter in read_csv:
#https://stackoverflow.com/q/845058
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
print (len_of_file)
skipped = np.setdiff1d(np.arange(len_of_file), np.arange(0,len_of_file,500))
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped)
How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
The idea is read only the datetime column by parameter usecols, and then sort and select each 500 index value, get the difference and pass again to parameter skiprows:
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
df1 = pd.read_csv('test.csv',
usecols=['tpep_pickup_datetime'],
parse_dates=['tpep_pickup_datetime'])
sorted_idx = (df1['tpep_pickup_datetime'].sort_values()
.iloc[np.arange(0,len_of_file,500)].index)
skipped = np.setdiff1d(np.arange(len_of_file), sorted_idx)
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped).sort_values(by=['tpep_pickup_datetime'])
use a lambda with skiprows:
pd.read_csv(path, skiprows=lambda i: i % N)
to skip every N rows.
source: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You can use csv module return a iterator and use itertools.cycle to select every nth row.
import csv
from itertools import cycle
source_file='D:/a.txt'
cycle_size=500
chooser = (x == 0 for x in cycle(range(cycle_size)))
with open(source_file) as f1:
rdr = csv.reader(f1)
data = [row for pick, row in zip(chooser, rdr) if pick]

Predictive Maintenance - How to use Bayesian Optimization with objective function and Logistic Regression with Gradient Descent together?

I'm trying to reproduce the problem shown in arimo.com
This is an example how to build a preventive maintenance Machine Learning model for an Hard Drive failures. The section I really don't understand is how to use Bayesian Optimization with a custom objective function and Logistic Regression with Gradient Descent together. What are the hyper-parameters to be optimized? What is the flow of the problem?
As described in our previous post, Bayesian Optimization [6] is used
to find the best hyperparameter values. The objective function to be
optimized in the hyperparameter tuning is the following score measured
on the validation set:
S = alpha * fnr + (1 – alpha) * fpr
where fpr and fnr are the False Positive and False Negative rates
obtained on the validation set. Our goal is to keep False Positive
rate low, therefore we use alpha = 0.2. Since the validation set is
highly unbalanced, we found out that standard scores like Precision,
F1-score, etc… do not work well. In fact, using this custom score is
crucial for the model to obtain a good performance generally.
Note that we only use the above score when running Bayesian
Optimization. To train logistic regression models, we use Gradient
Descent with the usual ridge loss function.
My dataframe before features selection:
index date serial_number model capacity_bytes failure Read Error Rate Reallocated Sectors Count Power-On Hours (POH) Temperature Current Pending Sector Count age yesterday_temperature yesterday_age yesterday_reallocated_sectors_count yesterday_read_error_rate yesterday_current_pending_sector_count yesterday_power_on_hours tomorrow_failure
0 77947 2013-04-11 MJ0331YNG69A0A Hitachi HDS5C3030ALA630 3000592982016 0 0 0 4909 29 0 36348284.0 29.0 20799895.0 0.0 0.0 0.0 4885.0 0.0
1 79327 2013-04-11 MJ1311YNG7EWXA Hitachi HDS5C3030ALA630 3000592982016 0 0 0 8831 24 0 36829839.0 24.0 21280074.0 0.0 0.0 0.0 8807.0 0.0
2 79592 2013-04-11 MJ1311YNG2ZD9A Hitachi HDS5C3030ALA630 3000592982016 0 0 0 13732 26 0 36924206.0 26.0 21374176.0 0.0 0.0 0.0 13708.0 0.0
3 80715 2013-04-11 MJ1311YNG2ZDBA Hitachi HDS5C3030ALA630 3000592982016 0 0 0 12745 27 0 37313742.0 27.0 21762591.0 0.0 0.0 0.0 12721.0 0.0
4 79958 2013-04-11 MJ1323YNG1EK0C Hitachi HDS5C3030ALA630 3000592982016 0 524289 0 13922 27 0 37050016.0 27.0 21499620.0 0.0 0.0 0.0 13898.0 0.0

Rounding off a list of numbers to a user-defined step while preserving their sum

I've been reading a lot of posts about rounding off numbers, but I couldn't manage to do what I want :
I have got a list of positive floats.
The unsigned integer roundOffStep to use is user-defined. I have no control other it.
I want to be able to do the most accurate rounding while preserving the sum of those numbers, or at least while keeping the new sum inferior to the original sum.
How would I do that ? I am terrible with algorithms, so this is way too tricky for me.
Thx.
EDIT : Adding a Test case :
FLOATS
29.20
18.25
14.60
8.76
2.19
sum = 73;
Let's say roundOffStep = 5;
ROUNDED FLOATS
30
15
15
10
0
sum = 70 < 73 OK
Round all numbers to the nearest multiple of roundOffStep normally.
If the new sum is lower than the original sum, you're done.
For each number, calculate rounded_number - original_number. Sort this list of differences in decreasing order so that you can find the numbers with the largest difference.
Pick the number that gives the largest difference rounded_number - original_number, and subtract roundOffStep from that number.
Repeat step 4 (picking the next largest difference each time) until the new sum is less than the original.
This process should ensure that the rounded numbers are as close as possible to the originals, without going over the original sum.
Example, with roundOffStep = 5:
Original Numbers | Rounded | Difference
----------------------+------------+--------------
29.20 | 30 | 0.80
18.25 | 20 | 1.75
14.60 | 15 | 0.40
8.76 | 10 | 1.24
2.19 | 0 | -2.19
----------------------+------------+--------------
Sum: 73 | 75 |
The sum is too large, so we pick the number giving the largest difference (18.25 which was rounded to 20) and subtract 5 to give 15. Now the sum is 70, so we're done.