Can anyone help meevaluate testing set data in Weka - testing

I got one training dataset and one testing dataset. I am using weka explorer, trying to create a model with Random forest (algorithm). After creating model when I use my testing set data to implement it by (supply test set/ re-evaluate on current dataset) tab, it showing some thing like that.
What am I doing wrong?
Training Model:
=== Evaluation on training set ===
Time taken to test model on training data: 0.24 seconds
=== Summary ===
Correctly Classified Instances 5243 98.9245 %
Incorrectly Classified Instances 57 1.0755 %
Kappa statistic 0.9439
Mean absolute error 0.0453
Root mean squared error 0.1137
Relative absolute error 23.2184 %
Root relative squared error 36.4074 %
Coverage of cases (0.95 level) 100 %
Mean rel. region size (0.95 level) 59.3019 %
Total Number of Instances 5300
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.996 0.067 0.992 0.996 0.994 0.944 0.999 1.000 0
0.933 0.004 0.968 0.933 0.950 0.944 0.999 0.990 1
Weighted Avg. 0.989 0.060 0.989 0.989 0.989 0.944 0.999 0.999
=== Confusion Matrix ===
a b <-- classified as
4702 18 | a = 0
39 541 | b = 1
Model Implement on my testing dataset:
=== Evaluation on test set ===
Time taken to test model on supplied test set: 0.22 seconds
=== Summary ===
Total Number of Instances 0
Ignored Class Unknown Instances 4000
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.000 0.000 0.000 0.000 0.000 0.000 ? ? 0
0.000 0.000 0.000 0.000 0.000 0.000 ? ? 1
Weighted Avg. NaN NaN NaN NaN NaN NaN NaN NaN
=== Confusion Matrix ===
a b <-- classified as
0 0 | a = 0
0 0 | b = 1

Your test data set does not appear to have labels.
You can only evaluate your prediction quality using labeled data.

Related

Negative binomial , Poisson-gamma mixture winbugs

Winbugs trap error
model
{
for (i in 1:5323) {
Y[i] ~ dpois(mu[i]) # NB model as a Poisson-gamma mixture
mu[i] ~ dgamma(b[i], a[i]) # NB model as a poisson-gamma mixture
a[i] <- b[i] / Emu[i]
b[i] <- B * X[i]
Emu[i] <- beta0 * pow(X[i], beta1) # model equation
}
# Priors
beta0 ~ dunif(0,10) # parameter
beta1 ~ dunif(0,10) # parameter
B ~ dunif(0,10) # over-dispersion parameter
}
X[] Y[]
1.5 0
2.9 0
1.49 0
0.39 0
3.89 0
2.03 0
0.91 0
0.89 0
0.97 0
2.16 0
0.04 0
1.12 1s
2.26 0
3.6 1
1.94 0
0.41 1
2 0
0.9 0
0.9 0
0.9 0
0.1 0
0.88 1
0.91 0
6.84 2
3.14 3
End ```
This is just a sample of the data, the model question is coming from Ezra Hauer 8.3.2, the art of regression of road safety, the model is providing an **error undefined real result. **
The aim of model is to fully Bayesian and a one step model and not use empirical bayes.
The results should be similar to MLE where beta0 is 1.65, beta1 0.871, overdispersion is 0.531
X is the only variable and y is actual collision,
So X cannot be zero or negative, while y cannot be lower than zero, if the model in solved as Poisson gamma mixture using maximum likelihood then it can be created
How can I make this model work
Solving an error in winbugs?
the data is in excel, the model worked fine when I selected the biggest 1000 observations only.

How can I group a continuous column (0-1) into equal sizes? Scala spark

I have a dataframe column that I want to split into equal size buckets. The values in this column are floats between 0-1. Most of the data is skewed, so most values fall in the 0.90's and 1.
Bucket 10: All 1's (the size of this bucket will be different from 2-9 and 1)
Bucket 2-9: Any values > 0 and < 1 (equal sized)
Bucket 1: All 0's (the size of this bucket will be different from 2-9 and 10)
Example:
continous_number_col
Bucket
0.001
2
0.95
9
1
10
0
1
This should be how it looks when I groupBy("Bucket")
Counts of bucket 1 and 10 aren't significant here, they will just be in their own bucket.
And the 75 count will be different, just using as an example.
Bucket
Count
Values
1
1000
0
2
75
0.01 - 0.50
3
75
0.51 - 0.63
4
75
0.64 - 0.71
5
75
0.72 - 0.83
6
75
0.84 - 0.89
7
75
0.90 - 0.92
8
75
0.93 - 0.95
9
75
0.95 - 0.99
10
2000
1
I've tried using the QuantileDiscretizer() Function as this:
val df = {
rawDf
//Taking 1's and 0's out for the moment
.filter(col("continuous_number_col") =!= 1 && col("continuous_number_col") =!= 0)
}
val discretizer = new QuantileDiscretizer()
.setInputCol("continuous_number_col")
.setOutputCol("bucket_result")
.setNumBuckets(8)
val result = discretizer.fit(df).transform(df)
However, this gives me the following, not equal buckets:
bucket_result
count
7.0
20845806
6.0
21096698
5.0
21538813
4.0
21222511
3.0
21193393
2.0
21413413
1.0
21032666
0.0
21681424
Hopefully this gives enough context to what I'm trying to do. Thanks in advance.

Sklearn only predicts one class while dataset is fairly balanced (±80/20 split)

I am trying to come up with a way to check what are the most influential factors of a person not paying back a loan (defaulting). I have worked with the sklearn library quite intensively, but I feel like I am missing something quite trivial...
The dataframe looks like this:
0 7590-VHVEG Female Widowed Electronic check Outstanding loan 52000 20550 108 0.099 288.205374 31126.180361 0 No Employed No Dutch No 0
1 5575-GNVDE Male Married Bank transfer Other 42000 22370 48 0.083 549.272708 26365.089987 0 Yes Employed No Dutch No 0
2 3668-QPYBK Male Registered partnership Bank transfer Study 44000 24320 25 0.087 1067.134272 26678.356802 0 No Self-Employed No Dutch No 0
The distribution of the "DefaultInd" column (target variable) is this:
0 0.835408
1 0.164592
Name: DefaultInd, dtype: float64
I have label encoded the data to make it look like this, :
CustomerID Gender MaritalStatus PaymentMethod SpendingTarget EstimatedIncome CreditAmount TermLoanMonths YearlyInterestRate MonthlyCharges TotalAmountPayments CurrentLoans SustainabilityIndicator EmploymentStatus ExistingCustomer Nationality BKR_Registration DefaultInd
0 7590-VHVEG 0 4 2 2 52000 20550 108 0.099 288.205374 31126.180361 0 0 0 0 5 0 0
1 5575-GNVDE 1 1 0 1 42000 22370 48 0.083 549.272708 26365.089987 0 1 0 0 5 0 0
2 3668-QPYBK 1 2 0 4 44000 24320 25 0.087 1067.134272 26678.356802 0 0 2 0 5 0
After that I have removed NaNs and cleaned it up some more (removing capitalizion, punctuation etc)
After that, I try to run this cell:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
y = df['DefaultInd']
X = df.drop(['CustomerID','DefaultInd'],axis=1)
X = X.astype(float)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=42)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(classification_report(y_test, y_pred))
Which results in this:
precision recall f1-score support
0 0.83 1.00 0.91 1073
1 0.00 0.00 0.00 213
accuracy 0.83 1286
macro avg 0.42 0.50 0.45 1286
weighted avg 0.70 0.83 0.76 1286
As you can see, the "1" class does not get predicted 1 time, I am wondering whether or not this behaviour is to be expected (I think it is not). I tried to use class_weightd = ‘balanced’, but that resulted in an average f1 score of 0.59 (instead of 0.76)
I feel like I am missing something, or is this kind of behaviour expected and should I rebalance the dataset before fitting? I feel like the division is not that skewed (±80/20), there should not be this big of a problem.
Any help would be more than appreciated :)

Finding Similarities Between houses in pandas dataframe for content filtering

I want to apply content filtering for houses. I would like to find similarity score for each houses to recommend. What can I recommend for house one? So I need similarity matrix for houses. How can I find it?
Thank you
data = [['house1',100,1500,'gas','3+1']
,['house2',120,2000,'gas','2+1']
,['house3',40,1600,'electricity','1+1']
,['house4',110,1450,'electricity','2+1']
,['house5',140,1200,'electricity','2+1']
,['house6',90,1000,'gas','3+1']
,['house7',110,1475,'gas','3+1']
]
Create the pandas DataFrame
df = pd.DataFrame(data, columns =
['house','size','price','heating_type','room_count'])
If we define similarity in terms of absolute difference in case of numeric values and similarity ratio calculated by SequenceMatcher in case of strings (or more presicely 1 - ratio to make it comparable to differences), we can apply these operations to the respective columns and then normalize the result to the range of 0 ... 1 where 1 means (almost) equality and 0 means minimum similarity. Summing up the individual columns, we get the most similar house as the house with the maximum total similarity rating.
from difflib import SequenceMatcher
df = df.set_index('house')
res = pd.DataFrame(df[['size','price']].sub(df.loc['house1',['size','price']]).abs())
res['heating_type'] = df.heating_type.apply(lambda x: 1 - SequenceMatcher(None, df.heating_type[0], x).ratio())
res['room_count'] = df.room_count.apply(lambda x: 1 - SequenceMatcher(None, df.room_count[0], x).ratio())
res['total'] = res['size'] + res.price + res.heating_type + res.room_count
res = 1 - res / res.max()
print(res)
print('\nBest match of house1 is ' + res.total[1:].idxmax())
Result:
size price heating_type room_count total
house
house1 1.000000 1.00 1.0 1.0 1.000000
house2 0.666667 0.00 1.0 0.0 0.000000
house3 0.000000 0.80 0.0 0.0 0.689942
house4 0.833333 0.90 0.0 0.0 0.882127
house5 0.333333 0.40 0.0 0.0 0.344010
house6 0.833333 0.00 1.0 1.0 0.019859
house7 0.833333 0.95 1.0 1.0 0.932735
Best match of house1 is house7

Predictive Maintenance - How to use Bayesian Optimization with objective function and Logistic Regression with Gradient Descent together?

I'm trying to reproduce the problem shown in arimo.com
This is an example how to build a preventive maintenance Machine Learning model for an Hard Drive failures. The section I really don't understand is how to use Bayesian Optimization with a custom objective function and Logistic Regression with Gradient Descent together. What are the hyper-parameters to be optimized? What is the flow of the problem?
As described in our previous post, Bayesian Optimization [6] is used
to find the best hyperparameter values. The objective function to be
optimized in the hyperparameter tuning is the following score measured
on the validation set:
S = alpha * fnr + (1 – alpha) * fpr
where fpr and fnr are the False Positive and False Negative rates
obtained on the validation set. Our goal is to keep False Positive
rate low, therefore we use alpha = 0.2. Since the validation set is
highly unbalanced, we found out that standard scores like Precision,
F1-score, etc… do not work well. In fact, using this custom score is
crucial for the model to obtain a good performance generally.
Note that we only use the above score when running Bayesian
Optimization. To train logistic regression models, we use Gradient
Descent with the usual ridge loss function.
My dataframe before features selection:
index date serial_number model capacity_bytes failure Read Error Rate Reallocated Sectors Count Power-On Hours (POH) Temperature Current Pending Sector Count age yesterday_temperature yesterday_age yesterday_reallocated_sectors_count yesterday_read_error_rate yesterday_current_pending_sector_count yesterday_power_on_hours tomorrow_failure
0 77947 2013-04-11 MJ0331YNG69A0A Hitachi HDS5C3030ALA630 3000592982016 0 0 0 4909 29 0 36348284.0 29.0 20799895.0 0.0 0.0 0.0 4885.0 0.0
1 79327 2013-04-11 MJ1311YNG7EWXA Hitachi HDS5C3030ALA630 3000592982016 0 0 0 8831 24 0 36829839.0 24.0 21280074.0 0.0 0.0 0.0 8807.0 0.0
2 79592 2013-04-11 MJ1311YNG2ZD9A Hitachi HDS5C3030ALA630 3000592982016 0 0 0 13732 26 0 36924206.0 26.0 21374176.0 0.0 0.0 0.0 13708.0 0.0
3 80715 2013-04-11 MJ1311YNG2ZDBA Hitachi HDS5C3030ALA630 3000592982016 0 0 0 12745 27 0 37313742.0 27.0 21762591.0 0.0 0.0 0.0 12721.0 0.0
4 79958 2013-04-11 MJ1323YNG1EK0C Hitachi HDS5C3030ALA630 3000592982016 0 524289 0 13922 27 0 37050016.0 27.0 21499620.0 0.0 0.0 0.0 13898.0 0.0