Using SMOTE with NaN values - missing-data

Is there a way one can use SMOTE with NaNs?
Here is a dummy prog to try using SMOTE in presence of NaN values
# Imports
from collections import Counter
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import Imputer
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline
from imblearn.combine import SMOTEENN
# Load data
bc = load_breast_cancer()
X, y = bc.data, bc.target
# Initial number of samples per class
print('Number of samples for both classes: {} and {}.'.format(*Counter(y).values()))
# SMOTEd class distribution
print('Dataset has %s missing values.' % np.isnan(X).sum())
_, y_resampled = SMOTE().fit_sample(X, y)
print('Number of samples for both classes: {} and {}.'.format(*Counter(y_resampled).values()))
# Generate artificial missing values
X[X > 1.0] = np.nan
print('Dataset has %s missing values.' % np.isnan(X).sum())
#_, y_resampled = make_pipeline(Imputer(), SMOTE()).fit_sample(X, y)
sm = SMOTE(ratio = 'auto',k_neighbors = 5, n_jobs = -1)
smote_enn = SMOTEENN(smote = sm)
x_train_res, y_train_res = smote_enn.fit_sample(X, y)
print('Number of samples for both classes: {} and {}.'.format(*Counter(y_resampled).values()))
I get the following output/error:
Number of samples for both classes: 212 and 357.
Dataset has 0 missing values.
Number of samples for both classes: 357 and 357.
Dataset has 6051 missing values.
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

You already include the answer. Notice that fit_resample is used instead of fit_sample. You should use the make_pipeline as follows:
# Imports
import numpy as np
from collections import Counter
from sklearn.datasets import load_breast_cancer
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline
from imblearn.combine import SMOTEENN
# Load data
bc = load_breast_cancer()
X, y = bc.data, bc.target
X[X > 1.0] = np.nan
# Over-sampling
smote = SMOTE(ratio='auto',k_neighbors=5, n_jobs=-1)
smote_enn = make_pipeline(SimpleImputer(), SMOTEENN(smote=smote))
_, y_res = smote_enn.fit_resample(X, y)
# Class distribution
print('Number of samples for both classes: {} and {}.'.format(*Counter(y_res).values()))
Check also your imbalanced-learn version.

Generally no, SMOTE is preparing a data set for further model fitting.
Usual models (like random forest, etc.) do not work with NAin the label variable, because what are you actually predicting here? The same goes for NAin the predictor variables where most algorithms either do not work or simply ignore cases with NA.
So the error is pretty much by design because you cannot and should not have missing values in your training data set for the algorithm and logically you do not want to "balance" cases with missing values, you only want to SMOTE cases with valid labels.
If you feel that missing labels still represent valid information that should be balanced (e.g. you actually want to oversample the NAclass because you think it is underpreresented), then it should not be a missing value but rather a defined value called "Unknown" or something else, indicating a known class with the characteristic of "NA" but I do not really see any research questions where this makes sense.
Update 1:
Another way to go is to impute the missing values first so that you actually have three steps in fitting your model:
Imputing missing values (using MICE or similar)
SMOTE to balance training set
Fit algorithm/model

Related

TensorFlow Probability (tfp) equivalent of np.quantile()

I am trying to find a TensorFlow equivalent of np.quantile(). I have found tfp.stats.quantiles() (tfp stands for TensorFlow Probability). However, its constructs are a bit different from that of np.quantile().
Consider the following example:
import tensorflow_probability as tfp
import tensorflow as tf
import numpy as np
inputs = tf.random.normal((1, 4096, 4))
print("NumPy")
print(np.quantile(inputs.numpy(), q=0.9, axis=1, keepdims=False))
I am not sure from the TFP docs how I could write the above using tfp.stats.quantile(). I tried checking out the source code of both methods, but it didn't help.
Let me try to be more helpful here than I was on GitHub.
There is a difference in behavior between np.quantile and tfp.stats.quantiles. The key difference here is that numpy.quantile will
Compute the q-th quantile of the data along the specified axis.
where q is the
Quantile or sequence of quantiles to compute, which must be between 0 and 1 inclusive.
and tfp.stats.quantiles
Given a vector x of samples, this function estimates the cut points by returning num_quantiles + 1 cut points
So you need to tell tfp.stats.quantiles how many quantiles you want and then select out the qth quantile. If it isn't clear how to do this just from the API, if you look at the source for tfp.stats.quantiles (for v0.19.0) we can see that it shows us how we can get a similar return structure as NumPy.
For completeness, setting up a virtual environment with
$ cat requirements.txt
numpy==1.24.2
tensorflow==2.11.0
tensorflow-probability==0.19.0
allows us to run
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
inputs = tf.random.normal((1, 4096, 4), dtype=tf.float64)
q = 0.9
numpy_quantiles = np.quantile(inputs.numpy(), q=q, axis=1, keepdims=False)
tfp_quantiles = tfp.stats.quantiles(
inputs, num_quantiles=100, axis=1, interpolation="linear"
)[int(q * 100)]
assert np.allclose(numpy_quantiles, tfp_quantiles.numpy())
print(f"{numpy_quantiles=}")
# numpy_quantiles=array([[1.31727661, 1.2699167 , 1.28735237, 1.27137588]])
print(f"{tfp_quantiles=}")
# tfp_quantiles=<tf.Tensor: shape=(1, 4), dtype=float64, numpy=array([[1.31727661, 1.2699167 , 1.28735237, 1.27137588]])>
You could also use tfp.stats.percentile(inputs, 90., axis=1, keepdims=False) -- the only difference from quantile is the 90. replacing .90.

How to use a custom loss function in GPfLOW 2?

I am new to GPflow and I am trying to figure out how to write a custom loss function to optimize the model. For my purpose, I need to manipulate the predicted output of the GP through different data treatments, and thus, it is the output I get after these treatments, that I would like the optimise the GP model according to. For that purpose I would like to use the root mean square error as loss function.
Workflow:
Input -> GP model -> GP_output -> Data treatment -> Predicted_output -> RMSE(Predicted_output, Observations)
I hope this makes sense.
Normally models are optimised doing something like this:
import gpflow as gf
import numpy as np
X = np.linspace(0, 100, num=100)
n = np.random.normal(scale=8, size=X.size)
y_obs = 10 * np.sin(X) + n
model = gf.models.GPR(
data=(X, y_obs),
kernel=gf.kernels.SquaredExponential(),
)
gf.optimizers.Scipy().minimize(
model.training_loss, model.trainable_variables, options=optimizer_config
)
I have figured out how to do a workaround using the scipy minimize function to optimise using RMSE, but I would like to stay within the GPflow framework, where I can just input model.trainable_variables as argument, and have a general function that also works if I have multiple input/output dimensions.
def objective_func(params):
model.kernel.lengthscales.assign(params[0])
model.kernel.variance.assign(params[1])
model.likelihood.variance.assign(params[2])
GP_output = model.predict_y(X)[0]
GP_output = GP_output.numpy()
Predicted_output = data_treatment_func(GP_output)
return np.sqrt(np.square(np.subtract(Predicted_output, y_obs)).mean())
from scipy.optimize import minimize
res = minimize(objective_func,
x0=(1.0, 1.0, 1.0),)
I found the answer myself.
If you write your objective_func using TensorFlow instead of NumPy (e.g. tf.math.sqrt, tf.reduce_mean) you can simply pass that to gf.optimizers.Scipy().minimize(...) instead of model.training_loss:
import tensorflow as tf
def objective_func():
GP_output = model.predict_y(X)[0]
Predicted_output = data_treatment_func(GP_output)
return tf.sqrt(tf.reduce_mean(tf.square(Predicted_output - y_obs)))
gf.optimizers.Scipy().minimize(
objective_func, model.trainable_variables, options=optimizer_config
)

print multiple random permutations

I am trying to do multiple permutations. From code:
# generate random Gaussian values
from numpy.random import seed
from numpy.random import randn
# seed random number generator
seed(1)
# generate some Gaussian values
values = randn(100)
print(values)
But now I would like to generate, for example, 20 permutations (of values).
With code:
import numpy as np
import random
from itertools import permutations
result = np.random.permutation(values)
print(result)
I can only observe one permutation (or "manually" get others). I wish I had many permutations (20 or more) and so automatically calculate the Durbin-Watson statistic for each permutation (from values).
from statsmodels.stats.stattools import durbin_watson
sm.stats.durbin_watson(np.random.permutation(values))
How can I do?
To get 20 permutations out of some collection, intialize the itertools.permutations iterator and then use next() to take the first twenty:
import numpy as np
import itertools as it
x = np.random.random(100) # 100 random floats
p = it.permutations(x) # an iterator which produces permutations (DON'T TRY TO CALCULATE ALL OF THEM)
first_twenty_permutations = [next(p) for _ in range(20)]
Of course, these won't be random permutations (i.e., they are calculated in an organized manner, try with it.permutations("abcdef") and you'll see what I mean). If you need random permutations, you can use np.random.permutation much in the same way:
[np.random.permutation(x) for _ in range(20)]
To then calculate the Durbin Watson statistic:
permutations = np.array([np.random.permutation(x) for _ in range(20)])
np.apply_along_axis(durbin_watson, axis=1, arr=permutations)

Evaluation of log density for various values of `mean`

I can evaluate the log probability density of a multivariate normal by doing
import numpy as np
import scipy.stats
scipy.stats.multivariate_normal.logpdf([0,0], mean = np.zeros(2), cov = np.eye(2))
Now, I'm interested in evaluating the log density of the point [0,0] over a variety of values of mean. Here is what I have tried
import numpy as np
import scipy.stats
grid = np.linspace(-2,2,51)
x,y = np.meshgrid(grid,grid)
scipy.stats.multivariate_normal.logpdf([0,0], mean = np.stack([x,y], axis = -1), cov = np.eye(2))
This results in an error: ValueError: Array 'mean' must be a vector of length 5202.
How can I evaluate the log density of a multivariate normal over a variety of values of mean?
As your error suggest logpdf is waiting a 1D array for the mean argument.
Since your covariance matrix is 2x2, you should give him a 2x1 array to mean.
If you want to evaluate the density for multiple mean values you can use a for loop after flattening x and y as follows :
import numpy as np
import scipy.stats
grid = np.linspace(-2,2,51)
x,y = np.meshgrid(grid,grid)
x,y = x.flatten(), y.flatten()
res = []
for i in range(len(x)):
x_i, y_i = x[i], y[i]
res.append(scipy.stats.multivariate_normal.logpdf([0,0], mean =[x_i,y_i], cov = np.eye(2)))
You can also also use list comprehension in place of the for loop :
res = [scipy.stats.multivariate_normal.logpdf([0,0], mean =[x_i,y_i], cov = np.eye(2)) for i in range(len(x))]
To visualize the result you can use matplotlib.pyplot :
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(x,y,c=res)
plt.show()
But I don't see the point of trying to evaluate the multivariate gaussian logpdf over several mean values.
In the case of a multivariate normal distribution the argument x and the mean m have symmetric roles as you can see in the exponential term : (x-m)^T Sigam^(-1) (x-m).
What you are doing is equivalent to evaluate the logpdf of a multivariate gaussian of mean [0,0] and of covariance eye(2).

Locally weighted smoothing for binary valued random variable

I have a random variable as follows:
f(x) = 1 with probability g(x)
f(x) = 0 with probability 1-g(x)
where 0 < g(x) < 1.
Assume g(x) = x. Let's say I am observing this variable without knowing the function g and obtained 100 samples as follows:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binned_statistic
list = np.ndarray(shape=(200,2))
g = np.random.rand(200)
for i in range(len(g)):
list[i] = (g[i], np.random.choice([0, 1], p=[1-g[i], g[i]]))
print(list)
plt.plot(list[:,0], list[:,1], 'o')
Plot of 0s and 1s
Now, I would like to retrieve the function g from these points. The best I could think is to use draw a histogram and use the mean statistic:
bin_means, bin_edges, bin_number = binned_statistic(list[:,0], list[:,1], statistic='mean', bins=10)
plt.hlines(bin_means, bin_edges[:-1], bin_edges[1:], lw=2)
Histogram mean statistics
Instead, I would like to have a continuous estimation of the generating function.
I guess it is about kernel density estimation but I could not find the appropriate pointer.
straightforward without explicitly fitting an estimator:
import seaborn as sns
g = sns.lmplot(x= , y= , y_jitter=.02 , logistic=True)
plug in x= your exogenous variable and analogously y = dependent variable. y_jitter is jitter the point for better visibility if you have a lot of data points. logistic = True is the main point here. It will give you the logistic regression line of the data.
Seaborn is basically tailored around matplotlib and works great with pandas, in case you want to extend your data to a DataFrame.