Multiclass tidymodel - class of outcome variable? - tidyverse

I want to do multiclass classification, and my y-variable is "character", three levels ("CD", "UC", "IBS")
How can I transform my y-variable into a factor/something the model will accept?
My model code:
boost_tree(trees=50) %>%
set_engine("xgboost") %>%
set_mode("classification") %>%
fit(diagnosis ~ ., data=train)
Error in check_outcome():
! For a classification model, the outcome should be a factor.
Backtrace:
... %>% fit(diagnosis ~ ., data = train)
parsnip::fit.model_spec(., diagnosis ~ ., data = train)
parsnip:::form_xy(...)
parsnip:::check_outcome(env$y, object)
Thanks a lot!

Before you do anything else (like data splitting or resampling), you can make it a factor via
train$diagnosis <- factor(train$diagnosis)
See the help files; there are other options that you can set such as the order of the factor levels and so on.

Related

Plotting grouped data in R (both numeric and categorical variables in X axis)

super new to R here. I'm trying to plot a graph to visualise my aggregate group data (consists of numeric and categorical data). Please help anyone!
DD %>%
select(Age_start_treatment, Skeletal_AP, Sex, Treatment_time) %>%
group_by(Age_start_treatment, Skeletal_AP, Sex) %>%
summarize(avg_total_treatment_time = mean(Treatment_time, na.rm=TRUE)) %>%
Unable to figure out the next step for the life of me but know I require the use of ggplot().
I need the best chart to plot the patients' age, skeletal class dimension (I,II or III) and sex against the total treatment time
Thanks

How to: TensorFlow-Probability custom loss that ignores NA values (or otherwise masks loss)

I seek to implement in TensorFlow-Probability a masked loss function, that can ignore NAs in the labels.
This is a well worn task for regular tensors. I cannot find an example for distributions.
My distributions are sized (batch, time-steps, outputs) (512, 251 days, 1 to 8 time series)
The traditional loss function given in examples is this using the distribution's log probability.
neg_log_likelihood <- function (x, rv_x) {
-1*(rv_x %>% tfd_log_prob(x))
}
When I replace NAs with zeros, the model trains fine and converges. When I leave in NAs it produces NaN losses as expected.
I've experimented with many different permutations of tf$where to replace loss with 0, the label with 0, etc. In each of those cases the model stops training and loss stays near some constant. That's the case even when there's just a single NA in the labels.
neg_log_likelihood_missing <- function (x, rv_x) {
loss = -1*( rv_x %>% tfd_log_prob(x) )
loss_nonan = tf$where( tf$math$is_finite(x) , loss, 0 )
return(
loss_nonan
)
}
My use of R here is incidental, and any examples in python or otherwise I can translate. If there's a correct way to this so that losses correctly back-propagate, I would greatly appreciate it.
If you are using gradient based inference, you may need the "double where" trick.
While this gets you a correct value of y:
y = computation(x)
tf.where(is_nan(y), 0, y)
...the derivative of the tf.where can still have a nan.
Instead write:
safe_x = tf.where(is_unsafe(x), some_safe_x, x)
y = computation(safe_x)
tf.where(is_unsafe(x), 0, y)
...to get both a safe y out and a safe dy/dx.
For the case you're considering, perhaps write:
class MyMaskedDist(tfd.Distribution):
...
def _log_prob(self, x):
safe_x = tf.where(tf.is_nan(x), self.mode(), x)
lp = compute_log_prob(safe_x)
lp = tf.where(tf.is_nan(x), tf.zeros([], lp.dtype), lp)
return lp

Number at risk for cox regression plot

can I make "number at risk "table for cox plot if I have more than one independent variable?
if it possible where can I find the relevant code (I searched but couldn't find)
the code I used on my data:
fit <- coxph(Surv(time,event) ~chr1q21_status+CCND1+CRTM1+IRF4,data = myeloma)
ggsurvplot(fit, data = myeloma,
risk.table=TRUE, break.time.by=365, xlim = c(0,4000),
risk.table.y.text=FALSE, legend.labs = c("2","3","4+"))
got this message- object 'ggsurv' not found' although for only one variable and the function survfit it worked.
"number at risk "table for cox plot
It's not a Cox plot, it's a Kaplan-Meier plot. You're trying to plot a Cox model, when what you want is to fit KM curves using survfit and then to plot the resulting fit:
library("survival")
library("survminer")
fit <- survfit(Surv(time,status) ~ ph.ecog + sex , data = lung)
ggsurvplot(fit, data = lung, risk.table = TRUE)
Since you now mention that you have continuous predictors, perhaps you could think about what you expect an at-risk table or KM plot to show.
Here's an example of binning a continuous measure (age):
library("survival")
library("survminer")
#> Loading required package: ggplot2
#> Loading required package: ggpubr
#> Loading required package: magrittr
lung$age_bin <- cut(lung$age, quantile(lung$age))
fit <- survfit(Surv(time,status) ~ age_bin + sex , data = lung)
ggsurvplot(fit, data = lung, risk.table = TRUE)

Using pymc3 to fit lomax model

I have a pretty simple example that doesn't seem to work. My goal is to build a Lomax model, and since PyMC3 doesn't have a Lomax distribution I use the fact that an Exponential mixed with a Gamma is a Lomax (see here):
import pymc3 as pm
from scipy.stats import lomax
# Generate artificial data with a shape and scale parameterization
data = lomax.rvs(c=2.5, scale=3, size=1000)
# if t ~ Exponential(lamda) and lamda ~ Gamma(shape, rate), then t ~ Lomax(shape, rate)
with pm.Model() as hierarchical:
shape = pm.Uniform('shape', 0, 10)
rate = pm.Uniform('rate', 0 , 10)
lamda = pm.Gamma('lamda', alpha=shape, beta=rate)
t = pm.Exponential('t', lam=lamda, observed=data)
trace = pm.sample(1000, tune=1000)
The summary is:
>>> pm.summary(trace)
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
shape 4.259874 2.069418 0.060947 0.560821 8.281654 1121.0 1.001785
rate 6.532874 2.399463 0.068837 2.126299 9.998271 1045.0 1.000764
lamda 0.513459 0.015924 0.000472 0.483754 0.545652 1096.0 0.999662
I would expect the shape and rate estimates to be close to 2.5 and 3 respectively. I tried various non-informative priors for shape and rate, including pm.HalfFlat() and pm.Uniform(0, 100) but both resulted in worse fits. Any ideas?
Figured it out: To derive a lomax from an exponential-gamma mixture, I need to specify a lamda for each example in the dataset (lamda = pm.Gamma('lamda', alpha=shape, beta=rate, shape=len(data)). This is because the model assumes each subject in the data has its own lamda_i where lamda_i ~ Gamma(shape, rate) for every i.

Linear Regression overfitting

I'm pursuing course 2 on this coursera course on linear regression (https://www.coursera.org/specializations/machine-learning)
I've solved the training using graphlab but wanted to try out sklearn for the experience and learning. I'm using sklearn and pandas for this.
The model overfits on the data. How can I fix this? This is the code.
These are the coefficients i'm getting.
[ -3.33628603e-13 1.00000000e+00]
poly1_data = polynomial_dataframe(sales["sqft_living"], 1)
poly1_data["price"] = sales["price"]
model1 = LinearRegression()
model1.fit(poly1_data, sales["price"])
print(model1.coef_)
plt.plot(poly1_data['power_1'], poly1_data['price'], '.',poly1_data['power_1'], model1.predict(poly1_data),'-')
plt.show()
The plotted line is like this. As you see it connects every data point.
and this is the plot of the input data
I wouldn't even call this overfit. I'd say you aren't doing what you think you should be doing. In particular, you forgot to add a column of 1's to your design matrix, X. For example:
# generate some univariate data
x = np.arange(100)
y = 2*x + x*np.random.normal(0,1,100)
df = pd.DataFrame([x,y]).T
df.columns = ['x','y']
You're doing the following:
model1 = LinearRegression()
X = df["x"].values.reshape(1,-1)[0] # reshaping data
y = df["y"].values.reshape(1,-1)[0]
model1.fit(X,y)
Which leads to:
plt.plot(df['x'].values, df['y'].values,'.')
plt.plot(X[0], model1.predict(X)[0],'-')
plt.show()
Instead, you want to add a column of 1's to your design matrix (X):
X = np.column_stack([np.ones(len(df['x'])),df["x"].values.reshape(1,-1)[0]])
y = df["y"].values.reshape(1,-1)
model1.fit(X,y)
And (after some reshaping) you get:
plt.plot(df['x'].values, df['y'].values,'.')
plt.plot(df['x'].values, model1.predict(X),'-')
plt.show()