Xgboost cox survival time entry - xgboost

In the new implementation of cox ph survival model in xgboost 0.81 how does one specify start and end time of an event?
Thanks
The R equivalent function would be for example :
cph_mod = coxph(Surv(Start, Stop, Status) ~ Age + Sex + SBP, data=data)

XGBoost do not allow for start (i.e. delayed entry). If it makes sense for the application, you can always change the underlying time scale so all subjects start at time=0. However, XGBoost does allow for right censored data. It seems impossible to find any documentation/example for how to implement a Cox model, but from the source code you can read "Cox regression for censored survival data (negative labels are considered censored)."
Here is a short example for anyone who want to try XGBoost with obj="survival:cox". We can compare the results to to the scikit-learn survival package sksurv. To make XGBoost more similar to that framework we use a linear booster instead of a tree booster.
import pandas as pd
import xgboost as xgb
from sksurv.datasets import load_aids
from sksurv.linear_model import CoxPHSurvivalAnalysis
# load and inspect the data
data_x, data_y = load_aids()
data_y[10:15]
Out[586]:
array([(False, 334.), (False, 285.), (False, 265.), ( True, 206.),
(False, 305.)], dtype=[('censor', '?'), ('time', '<f8')])
# Since XGBoost only allow one column for y, the censoring information
# is coded as negative values:
data_y_xgb = [x[1] if x[0] else -x[1] for x in data_y]
data_y_xgb[10:15]
Out[3]: [-334.0, -285.0, -265.0, 206.0, -305.0]
data_x = data_x[['age', 'cd4']]
data_x.head()
Out[4]:
age cd4
0 34.0 169.0
1 34.0 149.5
2 20.0 23.5
3 48.0 46.0
4 46.0 10.0
# Since sksurv output log hazard ratios (here relative to 0 on predictors)
# we must use 'output_margin=True' for comparability.
estimator = CoxPHSurvivalAnalysis().fit(data_x, data_y)
gbm = xgb.XGBRegressor(objective='survival:cox',
booster='gblinear',
base_score=1,
n_estimators=1000).fit(data_x, data_y_xgb)
prediction_sksurv = estimator.predict(data_x)
predictions_xgb = gbm.predict(data_x, output_margin=True)
d = pd.DataFrame({'xgb': predictions_xgb,
'sksurv': prediction_sksurv})
d.head()
Out[13]:
sksurv xgb
0 -1.892490 -1.843828
1 -1.569389 -1.524385
2 0.144572 0.207866
3 0.519293 0.502953
4 1.062392 1.045287
d.plot.scatter('xgb', 'sksurv')
Note that these are predictions on the same data that was use to fit the model. It seems that XGBoost get the values right but sometimes with a linear transformation. I do not know why. Play around with base_score and n_estimators. Perhaps someone can add to this answer.

Related

How to convert a mimira object (Cox regression model, from multiple imputations and a propensity score matching (MatchThem pkg)) into a Forest plot

Dear StackOverflow community,
as a surgeon, and full of enthusiasm for 6 months for R learning in self-taught mode (StackOverflow, and so many websites), I beg your indulgence in the triviality of my concern.
The background:
Briefly, my objective is to run a survival cox model regression for a dataset of cancer patients. Due to the retrospective aspect, I planned to make a matching 1:3 with propensity score matching (PSM). The missing data were dealt with multiple imputations ("mice" pkg). The PSM was managed with "MatchThem" pkg.
I used "survey" pkg for pooling the survival (svycoxph() pooled through with() function). This leads us to a mimira object, which I can easily print out into a beautiful Table, with tbl_regression ("gtsummary" pkg).
The issue:
As a usually print my cox regressions into a Hazard ratios Table and a graphical version (Forest plot with ggforest(), from "survminer" pkg), this time I am really stuck. The function ggforest doesn't recognize the mimira object as a "coxph object" and send this error :
Error in ggforest(tbl_regression_object, data = mimira_object) :
inherits(model, "coxph") is not TRUE
I guess that adding a PSM to my multiple imputations is the problem, as I had no problem for printing cox regression of multiple imputations with Forest plot (ggforest is able to deal mira objects without problem with pool_and_tidy_mice() function).
Here is the script:
#Data
library(fabricatr)
library(simsurv)
# Simulate patient data in a clinical trial
participant_data <- fabricate(
N = 2000,
age = runif(N, min = 18, max = 85),
is_female = draw_binary(prob = 0.5, N = N),
is_smoker = draw_binary(prob = 0.2 + 0.2 * (age > 50), N = N),
disease_stage = round(runif(N, min = 1 + 0.5 * (age > 65), max = 4)),
treatment = draw_binary(prob = 0.5, N = N),
kps = runif(N, min = 40, max = 100)
)
# Simulate data in the survival context
survival_data <- simsurv(
lambdas = 0.1, gammas = 1.8,
x = participant_data,
betas = c(is_female = -0.2, is_smoker = 1.2,
treatment = -0.4, kps = -0.005,
disease_stage = 0.2),
maxt = 5)
# Merging df
library(dplyr)
mydata_complete <- bind_cols(survival_data, participant_data)
# generating missing value
library(missMethods)
mydata_uncomp <- delete_MCAR(mydata_complete, 0.3)
mydata <- mydata_uncomp
#1 imputation with "mice"
library(mice)
mydata$nelsonaalen <- nelsonaalen(mydata, eventtime, status)
mydata_mice_imp_m3 <- mice(mydata, maxit = 2, m = 3, seed = 20200801) # m=3 is for testing
#2 matching (PSM 1:3) with "MatchThem"
library(MatchThem)
mydata_imp_m3_psm <- matchthem(treatment ~ age + is_female + disease_stage, data = mydata_mice_imp_m3, approach = "within" ,ratio= 1, method = "optimal")
#3 Pooling Coxph models in multiple imputed datasets and PSM with "survey"
library(survey)
mimira_object <- with(data = mydata_imp_m3_psm, expr = svycoxph(Surv(eventtime, status) ~ age+ is_smoker + disease_stage))
pool_and_tidy_mice(mimira_object, exponentiate = TRUE, conf.int=TRUE) -> pooled_imp_m3_cph
# estimates with pool_and_tidy_mice() works with mimira_object but cannot bring me de degree of freedoms. Warning message :
In get.dfcom(object, dfcom) : Infinite sample size assumed.
> pooled_imp_m3_cph
term estimate std.error statistic p.value conf.low conf.high b df dfcom fmi lambda m riv ubar
1 age 0.9995807 0.001961343 -0.2138208 NaN NaN NaN 1.489769e-06 NaN Inf NaN 0.5163574 3 1.067643 1.860509e-06
2 is_smoker 2.8626952 0.093476026 11.2516931 NaN NaN NaN 4.182884e-03 NaN Inf NaN 0.6382842 3 1.764601 3.160589e-03
3 disease_stage 1.2386947 0.044092483 4.8547535 NaN NaN NaN 8.995628e-04 NaN Inf NaN 0.6169374 3 1.610540 7.447299e-04
#4 Table summary of the pooled results
library(gtsummary)
tbl_regression_object <- tbl_regression(mimira_object, exp=TRUE, conf.int = TRUE) # 95% CI and p-value are missing due to an issue with an other issue in the pooling of the mimira_object. The Matchthem:::get.2dfcom function gives a dfcom = 999999 (another issue to be solved in my concern)
#5 What it should looks like as graphical summary
library(survival)
mydata.cox <- coxph(Surv(eventtime, status) ~ age+ is_smoker + disease_stage, mydata_uncomp) # (df mydata_uncomp is without imputation and PSM)
#with gtsummary
forestGT <-
mydata.cox %>%
tbl_regression(exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE) %>%
plot()
(forestGT) # See picture GT_plot1. Almost perfect. Would have been great to know how to add N, 95% CI, HR, p-value and parameters of the model (AIC, events, concordance, etc.)
#with survminer
HRforest <-
survminer::ggforest(mydata.cox, data = mydata_uncomp)
(HRforest) # See picture Ggforest. Everything I need to know about my cox regression is all in there. For me it is just a great regression cox forest plot.
#6 Actually what happens when I do the same thing with imputed and matched df
#with gtsummary
forestGT_imp_psm <-
mimira_object %>%
tbl_regression(exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE) %>%
plot() # WARNING message : In get.dfcom(object, dfcom) : Infinite sample size assumed.
(forestGT_imp_psm) # See picture GT_plot2. The plot is rendered but without 95% IC
#with survminer
HRforest_imp_psm <-
ggforest(mimira_object, data = mydata_imp_m3_psm) # ERROR:in ggforest(mimira_object, data = mydata_imp_m3_psm) : inherits(model, "coxph") is not TRUE
(HRforest_imp_psm)
#7 The lucky and providential step
# your solution/advise
Would greatly appreciate your help.
cheers.
AK
Picture GT_plot1
(not allowed to embed images in this post, here is sharelink : GT_plot1
Picture Ggforest_plot
Ggforest_plot
Picture GT_plot2
GT_plot2
It seems that there are two distinct problems here:
Problem #1. getting gtsummary() to produce a table with p values and confidence intervals of the pooled, matched data
Problem #2. producing a ggforest() to produce a plot of the pooled estimates.
Problem #1:
Let us follow the instructions in the paper "MatchThem:: Matching and Weighting after Multiple Imputation" (https://arxiv.org/ftp/arxiv/papers/2009/2009.11772.pdf) [page 15]
and modify your block #3. Instead of calling pool_and_tidy_mice() we do the following:
matched.results <- pool(mimira_object)
summary(matched.results, conf.int = TRUE)
This produces the following:
term estimate std.error statistic df p.value 2.5 % 97.5 %
1 age -0.0005997864 0.001448251 -0.4141453 55.266353 6.803707e-01 -0.003501832 0.00230226
2 is_smoker 1.1157796620 0.077943244 14.3152839 9.961064 5.713387e-08 0.942019234 1.28954009
3 disease_stage 0.2360965310 0.051799813 4.5578645 3.879879 1.111782e-02 0.090504018 0.38168904
This means that performing the imputation with mice and then matching with MatchThem works, since you do get the p values and the confidence intervals.
Compare to the output from pool_and_tidy_mice():
term estimate std.error statistic p.value b df dfcom fmi lambda m
1 age -0.0005997864 0.001448251 -0.4141453 NaN 2.992395e-07 NaN Inf NaN 0.1902260 3
2 is_smoker 1.1157796620 0.077943244 14.3152839 NaN 2.041627e-03 NaN Inf NaN 0.4480827 3
3 disease_stage 0.2360965310 0.051799813 4.5578645 NaN 1.444843e-03 NaN Inf NaN 0.7179644 3
riv ubar
1 0.2349124 1.698446e-06
2 0.8118657 3.352980e-03
3 2.5456522 7.567636e-04
Where everything is the same except for df and p.value which were not calculated in the latter table.
I therefore think this is an issue with the pool_and_tidy_mice() and you should post this as an issue on GitHub at gtsummary.
For right now, you can bypass this problem by changing svycoxph() to survival::coxph() in block #3 when you call the with() function. If you do that, then eventually you will get a gtsummary table with p.values and confidence intervals. Ultimately, the problem is probably some interaction between svycoxph() and pool_and_mice(), hence why I believe that you should post this on GitHub.
Problem #2:
The short answer is that there cannot be a ggforest plot with all the data that you are looking for.
https://www.rdocumentation.org/packages/mice/versions/3.13.0/topics/pool reads:
A common error is to reverse steps 2 and 3, i.e., to pool the multiply-imputed data instead of the estimates. Doing so may severely bias the estimates of scientific interest and yield incorrect statistical intervals and p-values. The pool() function will detect this case.
This means that there is no "real" dataset for the pooled estimates (i.e. you cannot really combine the datasets for imputations 1-3), which means that ggforest() cannot compute the desired plot (since it needs to have a dataset and that cannot be used because it would lead to erroneous estimates).
What you could do, is present all the ggforest plots for each imputation (so if you did 3 imputations, you will get 3 slightly different ggforest plots) and finally add the pooled estimates plot by using plot() as suggested above.
To create each ggforest plot you need the following line of code:
ggforest(mimira_object$analyses[[1]], complete(mydata_imp_m3_psm, 1))
This will create the ggforest plot for your first imputation. Change the numbers to 2 and 3 to check the remaining imputations.
I hope this helped,
Alex
If you provide a reproducible example (i.e. an example on a data set that we can all run on our machines), we can better help you out.
The gtsummary package exports a plot() method you can use to construct a forest plot. Example below!
library(gtsummary)
library(survival)
ggforest <-
coxph(Surv(ttdeath, death) ~ trt + grade, trial) %>%
tbl_regression(exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE) %>%
plot()
#> Registered S3 method overwritten by 'GGally':
#> method from
#> +.gg ggplot2
ggforest
Created on 2021-08-26 by the reprex package (v2.0.1)

How do I pre-process the dataset if the feature ranges are too wide?

I have a dataset with 5 features and each column being in a different range of numbers. I have tried using MinMaxScaler and StandardScaler but the accuracy for this multi-class problem is too low.
If StandardScaler and MinMaxScaler don't have the desired affect, then another thing to check for is skewed data:
# Check the skew of all numerical features
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)
Lower is better. If you get high scores, you can use a transform (log, boxcox, etc) to make the data distribution more normal in shape.
correcting for skew:
skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))
from scipy.special import boxcox1p
skewed_features = skewness.index
lam_f = 0.15
for feat in skewed_features:
#all_data[feat] += 1
all_data[feat] = boxcox1p(all_data[feat], lam_f)
Other things to try:
either remove fliers or try RobustScaler()
PowerTransformer()
Reference: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html

Look up BernoulliNB Probability in Dataframe

I have some training data (TRAIN) and some test data (TEST).
Each row of each dataframe contains an observed class (X) and some columns of binary (Y). BernoulliNB predicts the probability of X given Y in the test data based on the training data. I am trying to look up the probability of the observed class of each row in the test data (Pr).
Edit: I used Antoine Zambelli's advice to fix the code:
from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()
# Training Data
TRAIN = pd.DataFrame({'X' : [1,2,3,9],
'Y1': [1,1,0,0],
'Y4': [1,0,0,0]})
# Test Data
TEST = pd.DataFrame({'X' : [5,0,1,1,1,2,2,2,2],
'Y1': [1,1,0,1,0,1,0,0,0],
'Y2': [1,0,1,0,1,0,1,0,1],
'Y3': [1,1,0,1,1,0,0,0,0],
'Y4': [1,1,0,1,1,0,0,0,0]})
# Add the information that TRAIN has none of the missing items
diff_cols = set(TEST.columns)-set(TRAIN.columns)
for i in diff_cols:
TRAIN[i] = 0
# Split the data
Se_Tr_X = TRAIN['X']
Se_Te_X = TEST ['X']
df_Tr_Y = TRAIN .drop('X', axis=1)
df_Te_Y = TEST .drop('X', axis=1)
# Train: Bernoulli Naive Bayes Classifier
A_F = BNB.fit(df_Tr_Y, Se_Tr_X)
# Test: Predict Probability
Ar_R = BNB.predict_proba(df_Te_Y)
df_R = pd.DataFrame(Ar_R)
# Rename the columns after the classes of X
df_R.columns = BNB.classes_
df_S = df_R .join(TEST)
# Look up the predicted probability of the observed X
# Skip X's that are not in the training data
def get_lu(df):
def lu(i, j):
return df.get(j, {}).get(i, np.nan)
return lu
df_S['Pr'] = [*map(get_lu(df_R), df_S .T, df_S .X)]
This seemed to work, giving me the result (df_S):
This correctly gives a "NaN" for the first 2 rows because the training data contains no information about classes X=5 or X=0.
Ok, there's a couple issues here. I have a full working example below, but first those issues. Mainly the assertion that "This correctly gives a "NaN" for the first 2 rows".
This ties back to the way classification algorithms are used and what they can do. The training data contains all the information you want your algorithm to know and be able to act on. The test data is only going to be processed with that information in mind. Even if you (the person) know that the test label is 5 and not included in the training data, the algorithm doesn't know that. It is only going to look at the feature data and then try to predict the label from those. So it can't return nan (or 5, or anything not in the training set) - that nan is coming from your work going from df_R to df_S.
This leads to the second issue which is the line df_Te_Y = TEST .iloc[ : , 1 : ], that line should be df_Te_Y = TEST .iloc[ : , 2 : ], so that it does not include the label data. Label data only appears in the training set. The predicted labels will only ever be drawn from the set of labels that appear in the training data.
Note: I've changed the class labels to be Y and the feature data to be X because that's standard in the literature.
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
import pandas as pd
BNB = BernoulliNB()
# Training Data
train_df = pd.DataFrame({'Y' : [1,2,3,9], 'X1': [1,1,0,0], 'X2': [0,0,0,0], 'X3': [0,0,0,0], 'X4': [1,0,0,0]})
# Test Data
test_df = pd.DataFrame({'Y' : [5,0,1,1,1,2,2,2,2],
'X1': [1,1,0,1,0,1,0,0,0],
'X2': [1,0,1,0,1,0,1,0,1],
'X3': [1,1,0,1,1,0,0,0,0],
'X4': [1,1,0,1,1,0,0,0,0]})
X = train_df.drop('Y', axis=1) # Known training data - all but 'Y' column.
Y = train_df['Y'] # Known training labels - just the 'Y' column.
X_te = test_df.drop('Y', axis=1) # Test data.
Y_te = test_df['Y'] # Only used to measure accuracy of prediction - if desired.
Ar_R = BNB.fit(X, Y).predict_proba(X_te) # Can be combined to a single line.
df_R = pd.DataFrame(Ar_R)
df_R.columns = BNB.classes_ # Rename as per class labels.
# Columns are class labels and Rows are observations.
# Each entry is a probability of that observation being assigned to that class label.
print(df_R)
predicted_labels = df_R.idxmax(axis=1).values # For each row, take the column with the highest prob in that row.
print(predicted_labels) # [1 1 3 1 3 2 3 3 3]
print(accuracy_score(Y_te, predicted_labels)) # Percent accuracy of prediction.
print(BNB.fit(X, Y).predict(X_te)) # [1 1 3 1 3 2 3 3 3], can be used in one line if predicted_label is all we want.
# NOTE: change train_df to have 'Y': [1,2,1,9] and we get predicted_labels = [1 1 9 1 1 1 9 1 9].
# So probabilities have changed.
I recommend reviewing some tutorials or other material on clustering algorithms if this doesn't make sense after reading the code.

how to input data for shapiro wilk test using python scipy

I am trying to do Normality test with my data.
# Method 1
import numpy as np
from scipy.stats import shapiro
data = [1874181.6503, 2428393.05517, 2486600.8183,...] # there are 146 data points
data = np.array(data)
stat, p = shapiro(data)
print('statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
print('its Gaussian ')
else:
print('not Gaussian')
----Output----
statistics=0.582, p=0.000
not Gaussian
When i run it i am getting its not gaussian, and when i calculate mean and standard deviation and generate sample using np.random.normal(mu,sigma, 149) (sample shown below )then its showing as Gaussian
# Method 2
import numpy as np
from scipy.stats import shapiro
data = [1874181.6503, 2428393.05517, 2486600.8183,...] # there are 146 data points
data = np.array(data)
d_mu = np.mean(data)
d_sig = np.std(data)
data = np.random.normal(d_mu,d_sig, 146)
stat, p = shapiro(data)
print('statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
print('its Gaussian ')
else:
print('not Gaussian')
------ Output ----
statistics=0.987, p=0.212
its Gaussian
I am very new to Data analytics, It will be helpful if someone can help me on the below doubts
Which is the right method to do shapiro test ..? Method 1 or Method 2..?
I have difficulty in understanding the np.random.normal(d_mu,d_sig, 146) function . The definition given in docs is "Draw random samples from a normal (Gaussian) distribution." But what data sample its generating , we already have data(my input data) and we have calculated mean and standard deviation to plot the normal distribution and the function returns some other data sample and my shapiro test works for that ( i know i am completely taking it wrongly, but not able to decide which one is right )
I am trying to do normal distribution for timeseries data . Any docs helpful links any one can suggest ...? to do normality test and to normal distribution . Anything that guide me in the right direction

how is pandas kurtosis defined?

I am trying to get kurtosis using pandas. By doing some exploration, I have
test_series = pd.Series(np.random.randn(5000))
test_series.kurtosis()
however, the output is:
-0.006755982906479385
But I think the kurtosis (https://en.wikipedia.org/wiki/Kurtosis) should be close to (maybe normalize over N-1 instead of N, but this does not matter here)
(test_series - test_series.mean()).pow(4).mean()/np.power(test_series.std(),4)
which is
2.9908543104146026
The pandas documentation says the following
Return unbiased kurtosis over requested axis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0)
This is probably the excess kurtosis, defined as kurtosis - 3.
Pandas is calculating the UNBIASED estimator of the excess Kurtosis. Kurtosis is the normalized 4th central moment. To find the unbiased estimators of the cumulants you need the k-statistics.
So the unbiased estimator of kurtosis is (k4/k2**2)
To illustrate this:
import pandas as pd
import numpy as np
np.random.seed(11234)
test_series = pd.Series(np.random.randn(5000))
test_series.kurtosis()
#-0.0411811269445872
Now we can calculate this explicitly using the k-statistics:
n = len(test_series)
S1 = test_series.pow(1).sum()
S2 = test_series.pow(2).sum()
S3 = test_series.pow(3).sum()
S4 = test_series.pow(4).sum()
# Eq (7) and (5) from the k-statistics link
k4 = (-6*S1**4 + 12*n*S1**2*S2 - 3*n*(n-1)*S2**2 -4*n*(n+1)*S1*S3 + n**2*(n+1)*S4)/(n*(n-1)*(n-2)*(n-3))
k2 = (n*S2-S1**2)/(n*(n-1))
# k2 is the same as the N-1 variance: test_series.std(ddof=1)**2
k4/k2**2
#-0.04118112694458816
If you want better agreement to more decimal places, you'll need to be careful with the sums as they get rather large. But they're identical to 12 places.