I am trying to come up with a way to check what are the most influential factors of a person not paying back a loan (defaulting). I have worked with the sklearn library quite intensively, but I feel like I am missing something quite trivial...
The dataframe looks like this:
0 7590-VHVEG Female Widowed Electronic check Outstanding loan 52000 20550 108 0.099 288.205374 31126.180361 0 No Employed No Dutch No 0
1 5575-GNVDE Male Married Bank transfer Other 42000 22370 48 0.083 549.272708 26365.089987 0 Yes Employed No Dutch No 0
2 3668-QPYBK Male Registered partnership Bank transfer Study 44000 24320 25 0.087 1067.134272 26678.356802 0 No Self-Employed No Dutch No 0
The distribution of the "DefaultInd" column (target variable) is this:
0 0.835408
1 0.164592
Name: DefaultInd, dtype: float64
I have label encoded the data to make it look like this, :
CustomerID Gender MaritalStatus PaymentMethod SpendingTarget EstimatedIncome CreditAmount TermLoanMonths YearlyInterestRate MonthlyCharges TotalAmountPayments CurrentLoans SustainabilityIndicator EmploymentStatus ExistingCustomer Nationality BKR_Registration DefaultInd
0 7590-VHVEG 0 4 2 2 52000 20550 108 0.099 288.205374 31126.180361 0 0 0 0 5 0 0
1 5575-GNVDE 1 1 0 1 42000 22370 48 0.083 549.272708 26365.089987 0 1 0 0 5 0 0
2 3668-QPYBK 1 2 0 4 44000 24320 25 0.087 1067.134272 26678.356802 0 0 2 0 5 0
After that I have removed NaNs and cleaned it up some more (removing capitalizion, punctuation etc)
After that, I try to run this cell:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
y = df['DefaultInd']
X = df.drop(['CustomerID','DefaultInd'],axis=1)
X = X.astype(float)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=42)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(classification_report(y_test, y_pred))
Which results in this:
precision recall f1-score support
0 0.83 1.00 0.91 1073
1 0.00 0.00 0.00 213
accuracy 0.83 1286
macro avg 0.42 0.50 0.45 1286
weighted avg 0.70 0.83 0.76 1286
As you can see, the "1" class does not get predicted 1 time, I am wondering whether or not this behaviour is to be expected (I think it is not). I tried to use class_weightd = ‘balanced’, but that resulted in an average f1 score of 0.59 (instead of 0.76)
I feel like I am missing something, or is this kind of behaviour expected and should I rebalance the dataset before fitting? I feel like the division is not that skewed (±80/20), there should not be this big of a problem.
Any help would be more than appreciated :)
[UPDATED] I'm working on a nonlinear ODEs system optimization and fitting it to experimental data. I have a system of 5 model ODEs which must be optimized by 17 parameters. My approach is to calculate the differences between solved ODEs and experimental data - function Differences, then use leastsq solver to minimize diferences and find the optimal parameters, as below code:
//RHSs of ODEs to be fitted:
function dx=model3(t,x,Kap,Ksa,Ko,Ks,Kia,Kis,p_Amax,q_Amax,qm,q_Smax,Yas,Yoa,Yxa,Yem,Yos,Yxsof,H)
X=x(1);
S=x(2);
A=x(3);
DO=x(4);
V=x(5);`
qs=((q_Smax*S/(S+Ks))*Kia/(Kia+A));
qsof=(p_Amax*qs/(qs+Kap));
qsox=(qs-qsof)*DO/(DO+Ko);
qsa=(q_Amax*A/(A+Ksa))*(Kis/(qs+Kis));
pa=qsof*Yas;
qa=pa-qsa;
qo=(qsox-qm)*Yos+qsa*Yoa;
u=(qsox-qm)*Yem+qsof*Yxsof+qsa*Yxa;
dx(1)=u*X-F*X/V;
dx(2)=(F*(Sf-S)/V)-qs*X;
dx(3)=qsa*X-(F*A/V);
dx(4)=200*(100-DO)-qo*X*H;
dx(5)=F;
endfunction
//experimental data:
//Dat=fscanfMat('dane_exper_III_etap.txt');
Dat = [
0 30 1.4 24.1 99 6884.754
1 35 0.2 23.2 89 6959.754
2 40 0.1 21.6 80 7034.754
3 52 0.1 19.5 67 7109.754
4 61 0.1 18.7 70 7184.754
5 66 0.1 16.4 79 7259.754
6 71 0.1 15 94 7334.754
7 74 0 14.3 100 7409.754
8 76 0 13.8 100 7484.754
9 78 0 13.4 100 7559.754
9.5 79 0 13.2 100 7597.254
10 79 0 13.5 100 7634.754]
t=Dat(:,1);
x_exp(:,1)=Dat(:,2);
x_exp(:,2)=Dat(:,3);
x_exp(:,3)=Dat(:,4);
x_exp(:,4)=Dat(:,5);
x_exp(:,5)=Dat(:,6);
global MYDATA;
MYDATA.t=t;
MYDATA.x_exp=x_exp;
MYDATA.funeval=0;
//calculating differences between calculated values and experimental data:
function f=Differences(k)
global MYDATA
t=MYDATA.t;
x_exp=MYDATA.x_exp;
Kap=k(1); //g/L
Ksa=k(2); //g/L
Ko=k(3); //g/L
Ks=k(4); //g/L
Kia=k(5); //g/L
Kis=k(6); //g/L
p_Amax=k(7); //g/(g*h)
q_Amax=k(8); //g/(g*h)
qm=k(9);
q_Smax=k(10);
Yas=k(11); //g/g
Yoa=k(12);
Yxa=k(13);
Yem=k(14);
Yos=k(15);
Yxsof=k(16);
H=k(17);
x0=x_exp(1,:);
t0=0;
F=75;
Sf=500;
%ODEOPTIONS=[1,0,0,%inf,0,2,10000,12,5,0,-1,-1]
x_calc=ode('rk',x0',t0,t,list(model3,Kap,Ksa,Ko,Ks,Kia,Kis,p_Amax,q_Amax,qm,q_Smax,Yas,Yoa,Yxa,Yem,Yos,Yxsof,H));
diffmat=x_calc'-x_exp;
//column vector of differences (concatenates 4 columns of the difference matrix)
f=diffmat(:);
MYDATA.funeval=MYDATA.funeval+1;
endfunction
// Initial guess
Kap=0.3; //g/L
Ksa=0.05; //g/L
Ko=0.1; //g/L
Ks=0.5; //g/L
Kia=0.5; //g/L
Kis=0.05; //g/L
p_Amax=0.4; //g/(g*h)
q_Amax=0.8; //g/(g*h)
qm=0.2;
q_Smax=0.6;
Yas=0.5; //g/g
Yoa=0.5;
Yxa=0.5;
Yem=0.5;
Yos=1.5;
Yxsof=0.22;
H=1000;
y0=[Kap;Ksa;Ko;Ks;Kia;Kis;p_Amax;q_Amax;qm;q_Smax;Yas;Yoa;Yxa;Yem;Yos;Yxsof;H];
yinf=[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100];
ysup=[%inf,%inf,%inf,%inf,%inf,%inf,3,3,3,3,3,3,3,3,3,3,10000];
[fopt,xopt,gopt]=leastsq(Differences,'b',yinf,ysup,y0);
Now result is:
0.2994018
0.0508325
0.0999987
0.4994088
0.5081272
0.
0.4004560
0.7050746
0.2774195
0.6068328
0.5
0.4926150
0.4053860
0.5255006
1.5018725
0.2193901
1000.0000
33591.642
Running this script causes such an error:
lsoda-- caution... t (=r1) and h (=r2) are
such that t + h = t at next step
(h = pas). integration continues
where r1 is : 0.5658105345269D+01 and r2 : 0.1884898700920D-17
lsoda-- previous message precedent given i1 times
will no more be repeated
where i1 is : 10
lsoda-- at t (=r1), mxstep (=i1) steps
needed before reaching tout
where i1 is : 500000
where r1 is : 0.5658105345270D+01
Excessive work done on this call (perhaps wrong jacobian type).
at line 27 of function Differences
I understand that problem is on ODEs solving step. Thus, I have tried changing the mxstep, as also solving method type to 'adams','rk', and 'stiff' - none of this solved the problem. Using 'fix' method in ode I get this error:
ode: rksimp exit with state 3.
Please advise how to solve this?
P.S. Experimental data in file 'dane_exper_III_etap.txt':
0 30 1.4 24.1 99 6884.754
1 35 0.2 23.2 89 6959.754
2 40 0.1 21.6 80 7034.754
3 52 0.1 19.5 67 7109.754
4 61 0.1 18.7 70 7184.754
5 66 0.1 16.4 79 7259.754
6 71 0.1 15 94 7334.754
7 74 0 14.3 100 7409.754
8 76 0 13.8 100 7484.754
9 78 0 13.4 100 7559.754
9.5 79 0 13.2 100 7597.254
10 79 0 13.5 100 7634.754
In Scilab leastsq (based on optim) is very poor and doesn't have global convergence properties, unlike ipopt which is available as an atoms module. Install it like this:
--> atomsInstall sci_ipopt
I have modified your script in the following way
keep classical use of ode for this kind of biological kinetics, i.e. "stiff" (which uses BDF method). The Runge-Kutta you were using is very poor as it is an explicit method only for gentle ODEs.
use ipopt instead of leastsq
use a try/catch/end block around the computation of the residual in order to catch failing calls to the ode solver.
use some weight for the residual. You should play with it in order to improve the fit
use a strictly positive lower bound instead of 0, as very low value of some parameters make the ode solver fail.
add a drawing callback that also saves current value of parameters in case where you stop the optimization with ctrl-C.
//RHSs of ODEs to be fitted:
function dx=model3(t,x,Kap,Ksa,Ko,Ks,Kia,Kis,p_Amax,q_Amax,qm,q_Smax,Yas,Yoa,Yxa,Yem,Yos,Yxsof,H)
X=x(1);
S=x(2);
A=x(3);
DO=x(4);
V=x(5);
qs=((q_Smax*S/(S+Ks))*Kia/(Kia+A));
qsof=(p_Amax*qs/(qs+Kap));
qsox=(qs-qsof)*DO/(DO+Ko);
qsa=(q_Amax*A/(A+Ksa))*(Kis/(qs+Kis));
pa=qsof*Yas;
qa=pa-qsa;
qo=(qsox-qm)*Yos+qsa*Yoa;
u=(qsox-qm)*Yem+qsof*Yxsof+qsa*Yxa;
dx(1)=u*X-F*X/V;
dx(2)=(F*(Sf-S)/V)-qs*X;
dx(3)=qsa*X-(F*A/V);
dx(4)=200*(100-DO)-qo*X*H;
dx(5)=F;
endfunction
//calculating differences between calculated values and experimental data:
function [f,x_calc]=Differences(k, t, x_exp)
Kap=k(1); //g/L
Ksa=k(2); //g/L
Ko=k(3); //g/L
Ks=k(4); //g/L
Kia=k(5); //g/L
Kis=k(6); //g/L
p_Amax=k(7); //g/(g*h)
q_Amax=k(8); //g/(g*h)
qm=k(9);
q_Smax=k(10);
Yas=k(11); //g/g
Yoa=k(12);
Yxa=k(13);
Yem=k(14);
Yos=k(15);
Yxsof=k(16);
H=k(17);
x0=x_exp(1,:);
t0=0;
F=75;
Sf=500;
[x_calc]=ode("stiff",x0',t0,t,list(model3,Kap,Ksa,Ko,Ks,Kia,Kis,p_Amax,q_Amax,qm,q_Smax,Yas,Yoa,Yxa,Yem,Yos,Yxsof,H));
diffmat=(x_calc'-x_exp)*residual_weight;
//column vector of differences (concatenates 4 columns of the difference matrix)
f=diffmat(:);
MYDATA.funeval=MYDATA.funeval+1;
endfunction
function [f,g]=normdiff2(k,new_k,t,x_exp)
try
res = Differences(k,t,x_exp)
if argn(1) == 2
JacRes = numderivative(list(Differences,t,x_exp),k)
g = 2*JacRes'*res;
end
f = sum(res.*res)
catch
f=%inf;
g=%inf*ones(k);
end
endfunction
function out=callback(param)
global MYDATA
if isfield(param,"x")
k = param.x;
MYDATA.k = k;
[f,x_calc]=Differences(k,t,x_exp)
plot_weight = diag(1./max(x_exp,'r'));
drawlater
clf
plot(t,x_exp*plot_weight,'-o')
plot(t,x_calc'*plot_weight,'-x')
legend X S A DO X
drawnow
end
out = %t;
endfunction
//experimental data:
//Dat=fscanfMat('dane_exper_III_etap.txt');
Dat = [
0 30 1.4 24.1 99 6884.754
1 35 0.2 23.2 89 6959.754
2 40 0.1 21.6 80 7034.754
3 52 0.1 19.5 67 7109.754
4 61 0.1 18.7 70 7184.754
5 66 0.1 16.4 79 7259.754
6 71 0.1 15 94 7334.754
7 74 0 14.3 100 7409.754
8 76 0 13.8 100 7484.754
9 78 0 13.4 100 7559.754
9.5 79 0 13.2 100 7597.254
10 79 0 13.5 100 7634.754]
t=Dat(:,1);
x_exp(:,1)=Dat(:,2);
x_exp(:,2)=Dat(:,3);
x_exp(:,3)=Dat(:,4);
x_exp(:,4)=Dat(:,5);
x_exp(:,5)=Dat(:,6);
global MYDATA;
MYDATA.funeval=0;
// Initial guess
Kap=0.3; //g/L
Ksa=0.05; //g/L
Ko=0.1; //g/L
Ks=0.5; //g/L
Kia=0.5; //g/L
Kis=0.05; //g/L
p_Amax=0.4; //g/(g*h)
q_Amax=0.8; //g/(g*h)
qm=0.2;
q_Smax=0.6;
Yas=0.5; //g/g
Yoa=0.5;
Yxa=0.5;
Yem=0.5;
Yos=1.5;
Yxsof=0.22;
H=100;
k0 = [Kap;Ksa;Ko;Ks;Kia;Kis;p_Amax;q_Amax;qm;q_Smax;Yas;Yoa;Yxa;Yem;Yos;Yxsof;H];
residual_weight = diag(1./[79,1.4, 24.1, 100, 7634.754]);
BIG = 1000;
SMALL = 1e-3;
problem = struct();
problem.f = list(normdiff2,t,x_exp);
problem.x0 = k0;
problem.x_lower = [SMALL*ones(16,1);100];
problem.x_upper = [BIG,BIG,BIG,BIG,BIG,BIG,3,3,3,3,3,3,3,3,3,3,10000]';
problem.int_cb = callback;
problem.params = struct("max_iter",200);
//
k = ipopt(problem)
Here is a plot of the results after 100 iterations (you can change the value in the ipopt options). However, don't expect normal termination as
it is almost sure that your parameter set is not identifiable
finite difference gradient with ODEs is very innacurate.
Hope it will help you a little bit...
I am reading a large file that contains ~9.5 million rows x 16 cols.
I am interested in retrieving a representative sample, and since the data is organized by time, I want to do this by selecting every 500th element.
I am able to load the data, and then select every 500th row.
My question: Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Question 2: How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
Here is a snippet of what the data looks like (first five rows) The first 4 rows are out of order, bu the remaining dataset looks ordered (by time):
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 1 2017-01-09 11:13:28 2017-01-09 11:25:45 1 3.30 1 N 263 161 1 12.5 0.0 0.5 2.00 0.00 0.3 15.30
1 1 2017-01-09 11:32:27 2017-01-09 11:36:01 1 0.90 1 N 186 234 1 5.0 0.0 0.5 1.45 0.00 0.3 7.25
2 1 2017-01-09 11:38:20 2017-01-09 11:42:05 1 1.10 1 N 164 161 1 5.5 0.0 0.5 1.00 0.00 0.3 7.30
3 1 2017-01-09 11:52:13 2017-01-09 11:57:36 1 1.10 1 N 236 75 1 6.0 0.0 0.5 1.70 0.00 0.3 8.50
4 2 2017-01-01 00:00:00 2017-01-01 00:00:00 1 0.02 2 N 249 234 2 52.0 0.0 0.5 0.00 0.00 0.3 52.80
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Something you could do is to use the skiprows parameter in read_csv, which accepts a list-like argument to discard the rows of interest (and thus, also select). So you could create a np.arange with a length equal to the amount of rows to read, and remove every 500th element from it using np.delete, so this way we'll only be reading every 500th row:
n_rows = 9.5e6
skip = np.arange(n_rows)
skip = np.delete(skip, np.arange(0, n_rows, 500))
df = pd.read_csv('my_file.csv', skiprows = skip)
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
First get the length of the file by a custom function, remove each 500 row by numpy.setdiff1d and pass to the skiprows parameter in read_csv:
#https://stackoverflow.com/q/845058
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
print (len_of_file)
skipped = np.setdiff1d(np.arange(len_of_file), np.arange(0,len_of_file,500))
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped)
How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
The idea is read only the datetime column by parameter usecols, and then sort and select each 500 index value, get the difference and pass again to parameter skiprows:
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
df1 = pd.read_csv('test.csv',
usecols=['tpep_pickup_datetime'],
parse_dates=['tpep_pickup_datetime'])
sorted_idx = (df1['tpep_pickup_datetime'].sort_values()
.iloc[np.arange(0,len_of_file,500)].index)
skipped = np.setdiff1d(np.arange(len_of_file), sorted_idx)
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped).sort_values(by=['tpep_pickup_datetime'])
use a lambda with skiprows:
pd.read_csv(path, skiprows=lambda i: i % N)
to skip every N rows.
source: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You can use csv module return a iterator and use itertools.cycle to select every nth row.
import csv
from itertools import cycle
source_file='D:/a.txt'
cycle_size=500
chooser = (x == 0 for x in cycle(range(cycle_size)))
with open(source_file) as f1:
rdr = csv.reader(f1)
data = [row for pick, row in zip(chooser, rdr) if pick]