How do I deal with class imbalance when using Sparklyr with MLib? - apache-spark-sql

I have a severe class imbalance where positive response is about 3%. The 3% absolute volume is about ~6000 rows. I'm currently using sparklyr and MLibs algorithms. Some of the native Databricks MLibs has class weight imbalance as a parameter. Is that available in sparklyr? I'm currently using ml_random_forest_classifier as the algorithm to classify a dichotomous outcome. thanks.
https://docs.databricks.com/machine-learning/automl/how-automl-works.html#imbalanced-dataset-support-for-classification-problems
Reproducible codes are here.
Sparklyr Spark ML Feature Importance after feature transformation

Related

A huge number of discrete features

I'm developing a regression model. But I ran into a problem when preparing the data. 17 out of 20 signs are categorical, and there are a lot of categories in each of them. Using one-hot-encoding, my data table is transformed into a 10000x6000 table. How should I prepare this type of data?
I used PCA, trying to reduce the dimension, but even 70% of the variance is in 2500 features. That's why I joined.
Unfortunately, I can't attach the dataset, as it is confidential
How do I prepare the data to achieve the best results in the learning process?
Can the data be mapped more accurately in a non-linear manner? If so, you might want to try using an autoencoder for dimensionality reduction.
One thing to note about PCA is that it computes an orthogonal projection of the data into linear space. This means that it only gives a linear mapping of the data. Autoencoders, on the other hand, can give you a non-linear mapping, and so is able to represent a greater amount of variance in the data in fewer dimensions. Just be sure to use non-linear activation functions in your autoencoder architecture.
It really depends on exactly what you are trying to do. Getting a covariance matrix (and also PCA decomp.) will give you great insight about which classes tend to come together (and this requires one-hot encoded categories), but training a model off of that might be problematic.
In general, it really depends on the model you want to use.
One option would be a random forest. They can definitely be used for regression, though they need to be trained specifically for that. SKLearn has a class just for this:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
The benifits of random forest is that it is great for tabular data (as is the case here), and can easily be trained using numerical values for class features, meaning your data vector can only be of dimension 20!
Decision tree models (such as random forest) are being shown to out-preform deep-learning in many cases, and this may be one of them.
TLDR; If you use a random forest, it can take learn even with numerical values for categories, and you can avoid creating incredibly large vectors for data.

Cluster Analysis in Weka, with the Bank dataset

I am using the bank dataset for cluster analysis in Weka.
The dataset is here: bank-k.arff
However, the dataset appears to be not ideal for cluster analysis. When using algorithms like EM, k-means, and cobweb, the result does not show trends. Classes to cluster evaluations yield unhelpful results. For instance, when using the region as the class:
bank cluster weka
Most of the graphs are like this, which does not yield helpful results.
In contrast, the iris dataset does return distinguishing results:
Iris in Weka
Am I doing it incorrectly? I am new in this field of study. Please give some advice if possible.
Think you in advance!

what is the kappa variable (BayesianOptimization)

I read some posts and tutorials about BayesianOptimization and I never saw explanation about kappa variable.
What is the kappa variable ?
How can it help us ?
How this values can influence the BayesianOptimization process ?
The kappa parameter, along with xi, is used to control how much the Bayesian optimization acquisition function balances exploration and exploitation.
Higher kappa values mean more exploration and less exploitation and vice versa for low values. Exploration pushes the search towards unexplored regions and exploitation focuses on results in the vicinity of the current best results by penalizing for higher variance values.
It may be beneficial to begin with default kappa values at the start of optimization and then lower values if you reduce the search space.
In scikit-optimize, kappa is only used if the acquisition function acq_func is set to “LCB” and xi is used when acq_func is “EI” or “PI” where LCB is Lower Confidence Bound, EI is Expected Improvement and PI is Probability of Improvement.
Similarly for the BayesianOptimization package:
acq: {'ucb', 'ei', 'poi'}
The acquisition method used.
* 'ucb' stands for the Upper Confidence Bounds method
* 'ei' is the Expected Improvement method
* 'poi' is the Probability Of Improvement criterion.
Mathematical details on acquisition functions
Note, the BayesianOptimization package and scikit-optimize use different default kappa values: 2.576 and 1.96 respectively.
There is a decent exploration vs exploitation example in the scikit-optimize docs.
There is a similar BayesianOptimization exploration vs exploitation example notebook.
FWIW I've used both packages and gotten OK results. I find the scikit-optimize plotting functions to be useful when fine tuning the parameter search space.

Why the value to explained various ratio is too low in binary classification problem?

My input data $X$ has 100 samples and 1724 features. I want to classify my data into two classes. But explained various ratio value is too low like 0.05,0.04, No matter how many principal components I fix while performing PCA. I have seen in the literature that usually explained variance ratio is close to 1. I'm unable to understand my mistake. I tried to perform PCA by reducing number of features and increasing number of samples. But it doesn't make any significant effect on my result.
`pca = PCA(n_components=10).fit(X)
Xtrain = pca.transform(X)
explained_ratio=pca.explained_variance_ratio_
EX=explained_ratio
fig,ax1=plt.subplots(ncols=1,nrows=1)
ax1.plot(np.arange(len(EX)),EX,marker='o',color='blue')
ax1.set_ylabel(r"$\lambda$")
ax1.set_xlabel("l")

Calculate average and class-wise precision/recall for multiple classes in TensorFlow

I have a multiclass model with 4 classes. I have already implemented a callback able to calculate the precision/recall for each class and their macro average. But for some technical reason, I have to calculate them using the metrics mechanism.
I'm using TensorFlow 2 and Keras 2.3.0. I have already used the tensorflow.keras.metrics.Recall/Precision to get the class-wise metrics:
metrics_list = ['accuracy']
metrics_list.extend([Recall(class_id=i, name="recall_{}".format(label_names[i])) for i in range(n_category)])
metrics_list.extend([Precision(class_id=i, name="precision_{}".format(label_names[i])) for i in range(n_category)])
model = Model(...)
model.compile(...metrics=metrics_list)
However, this solution is not satisfying:
firstly, tensorflow.keras.metrics.Recall/Precision uses a threshold to define the affiliation to a class, while it should use argmax to define the most probable class, if class_id is defined
Secondly, I have to create 2 new metrics that would calculate the average over all classes, which itself requires to calculate the class-wise metrics. This is inelegant and inefficient to calculate twice the same thing.
Is there a way to create a class or a function that would calculate directly the class-wise and the average predicion/recall using the TensorFlow/Keras metrics logic?
Apparently I can easily obtain the confusion matrix using tf.math.confusion_matrix(). However, I do not see how to inject a list of scalar at once, instead of returning a single scalar.
Any comment is welcomed!
It occurs that in my very specific case, I can simply use CategoricalAccuracy() as unique metric because i'm using a batch_size=1. It this case, accuracy=recall=precision={1.|0.} for a batch. That only partially solve the problem. The best solution would be to update the confusion matrix using argmax at each batch end, then calculate the Precision/Recall based on that. I don't known how it is possible to do that yet, but it should be doable.