random sampling from a data frame in pyspark - apache-spark-sql

In my data set I have 73 billion rows. I want to apply a classification algorithm on it. I need a sample from the original data so that I can test my model.
I want to do a train-test split.
Dataframe looks like -
id age gender salary bonus area churn
1 38 m 37654 765 bb 1
2 48 f 3654 365 bb 0
3 33 f 55443 87 uu 0
4 27 m 26354 875 jh 0
5 58 m 87643 354 vb 1
How to take random sampling using pyspark so that my dependent(churn) variable ration should not change.
Any suggestion?

You will find examples in the linked documentation.
Spark supports Stratified Sampling.
# an RDD of any key value pairs
data = sc.parallelize([(1, 'a'), (1, 'b'), (2, 'c'), (2, 'd'), (2, 'e'), (3, 'f')])
# specify the exact fraction desired from each key as a dictionary
fractions = {1: 0.1, 2: 0.6, 3: 0.3}
approxSample = data.sampleByKey(False, fractions)
You can also use the TrainValidationSplit
For example:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
# Prepare training and test data.
data = spark.read.format("libsvm")\
.load("data/mllib/sample_linear_regression_data.txt")
train, test = data.randomSplit([0.9, 0.1], seed=12345)
lr = LinearRegression(maxIter=10)
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# TrainValidationSplit will try all combinations of values and determine best model using
# the evaluator.
paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.1, 0.01]) \
.addGrid(lr.fitIntercept, [False, True])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.build()
# In this case the estimator is simply the linear regression.
# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
tvs = TrainValidationSplit(estimator=lr,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(),
# 80% of the data will be used for training, 20% for validation.
trainRatio=0.8)
# Run TrainValidationSplit, and choose the best set of parameters.
model = tvs.fit(train)
# Make predictions on test data. model is the model with combination of parameters
# that performed best.
model.transform(test)\
.select("features", "label", "prediction")\
.show()

To see sample from original data , we can use sample in spark:
df.sample(fraction).show()
Fraction should be between [0.0, 1.0]
example:
# run this command repeatedly, it will show different samples of your original data.
df.sample(0.2).show(10)

Related

Tensorflow dataset of sliding windows keeping track of index

I have a dataframe which contains time series data: for the sake of simplicity, let's say that the index is my "datetime" or just the element that establishes the order of the data. Columns a and b instead are real numbers and I set them equal to the index just to explain my problem.
import pandas as pd
import numpy as np
import tensorflow as tf
data = pd.DataFrame({'a': np.arange(100), 'b': np.arange(100)})
print(data)
Which outputs:
a b
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
.. .. ..
95 95 95
96 96 96
97 97 97
98 98 98
99 99 99
Then, I proceed to create a dataset of sliding windows over the time series dataframe:
data = np.array(data, dtype=np.float32)
ds = tf.keras.utils.timeseries_dataset_from_array(data=data, targets=None,
sequence_length=6,
sequence_stride=6,
sampling_rate=1,
shuffle=True,
batch_size=None,
seed=1)
for i in ds.take(3):
print(i)
Which outputs:
tf.Tensor( [[84. 84.]
[85. 85.]
[86. 86.]
[87. 87.]
[88. 88.]
[89. 89.]], shape=(6, 2), dtype=float32)
tf.Tensor(
[[30. 30.]
[31. 31.]
[32. 32.]
[33. 33.]
[34. 34.]
[35. 35.]], shape=(6, 2), dtype=float32)
tf.Tensor(
[[54. 54.]
[55. 55.]
[56. 56.]
[57. 57.]
[58. 58.]
[59. 59.]], shape=(6, 2), dtype=float32)
As you can see, each matrix is "datetime" ordered (sequence_length=6) and matrixes do not overlap (sequence_stride=6). I would like to keep track of the initial index. In other words, I want to be able to say extract the matrix with shape=(6, 2) that corresponds to the index values K:K+6. I know I could do this directly from the initial dataframe, but this is just a simplified version of a bigger problem: I am trying to replicate the section Data windowing of this Tensorflow tutorial such that I can plot exactly the date that I want, rather than random dates.

How to convert a mimira object (Cox regression model, from multiple imputations and a propensity score matching (MatchThem pkg)) into a Forest plot

Dear StackOverflow community,
as a surgeon, and full of enthusiasm for 6 months for R learning in self-taught mode (StackOverflow, and so many websites), I beg your indulgence in the triviality of my concern.
The background:
Briefly, my objective is to run a survival cox model regression for a dataset of cancer patients. Due to the retrospective aspect, I planned to make a matching 1:3 with propensity score matching (PSM). The missing data were dealt with multiple imputations ("mice" pkg). The PSM was managed with "MatchThem" pkg.
I used "survey" pkg for pooling the survival (svycoxph() pooled through with() function). This leads us to a mimira object, which I can easily print out into a beautiful Table, with tbl_regression ("gtsummary" pkg).
The issue:
As a usually print my cox regressions into a Hazard ratios Table and a graphical version (Forest plot with ggforest(), from "survminer" pkg), this time I am really stuck. The function ggforest doesn't recognize the mimira object as a "coxph object" and send this error :
Error in ggforest(tbl_regression_object, data = mimira_object) :
inherits(model, "coxph") is not TRUE
I guess that adding a PSM to my multiple imputations is the problem, as I had no problem for printing cox regression of multiple imputations with Forest plot (ggforest is able to deal mira objects without problem with pool_and_tidy_mice() function).
Here is the script:
#Data
library(fabricatr)
library(simsurv)
# Simulate patient data in a clinical trial
participant_data <- fabricate(
N = 2000,
age = runif(N, min = 18, max = 85),
is_female = draw_binary(prob = 0.5, N = N),
is_smoker = draw_binary(prob = 0.2 + 0.2 * (age > 50), N = N),
disease_stage = round(runif(N, min = 1 + 0.5 * (age > 65), max = 4)),
treatment = draw_binary(prob = 0.5, N = N),
kps = runif(N, min = 40, max = 100)
)
# Simulate data in the survival context
survival_data <- simsurv(
lambdas = 0.1, gammas = 1.8,
x = participant_data,
betas = c(is_female = -0.2, is_smoker = 1.2,
treatment = -0.4, kps = -0.005,
disease_stage = 0.2),
maxt = 5)
# Merging df
library(dplyr)
mydata_complete <- bind_cols(survival_data, participant_data)
# generating missing value
library(missMethods)
mydata_uncomp <- delete_MCAR(mydata_complete, 0.3)
mydata <- mydata_uncomp
#1 imputation with "mice"
library(mice)
mydata$nelsonaalen <- nelsonaalen(mydata, eventtime, status)
mydata_mice_imp_m3 <- mice(mydata, maxit = 2, m = 3, seed = 20200801) # m=3 is for testing
#2 matching (PSM 1:3) with "MatchThem"
library(MatchThem)
mydata_imp_m3_psm <- matchthem(treatment ~ age + is_female + disease_stage, data = mydata_mice_imp_m3, approach = "within" ,ratio= 1, method = "optimal")
#3 Pooling Coxph models in multiple imputed datasets and PSM with "survey"
library(survey)
mimira_object <- with(data = mydata_imp_m3_psm, expr = svycoxph(Surv(eventtime, status) ~ age+ is_smoker + disease_stage))
pool_and_tidy_mice(mimira_object, exponentiate = TRUE, conf.int=TRUE) -> pooled_imp_m3_cph
# estimates with pool_and_tidy_mice() works with mimira_object but cannot bring me de degree of freedoms. Warning message :
In get.dfcom(object, dfcom) : Infinite sample size assumed.
> pooled_imp_m3_cph
term estimate std.error statistic p.value conf.low conf.high b df dfcom fmi lambda m riv ubar
1 age 0.9995807 0.001961343 -0.2138208 NaN NaN NaN 1.489769e-06 NaN Inf NaN 0.5163574 3 1.067643 1.860509e-06
2 is_smoker 2.8626952 0.093476026 11.2516931 NaN NaN NaN 4.182884e-03 NaN Inf NaN 0.6382842 3 1.764601 3.160589e-03
3 disease_stage 1.2386947 0.044092483 4.8547535 NaN NaN NaN 8.995628e-04 NaN Inf NaN 0.6169374 3 1.610540 7.447299e-04
#4 Table summary of the pooled results
library(gtsummary)
tbl_regression_object <- tbl_regression(mimira_object, exp=TRUE, conf.int = TRUE) # 95% CI and p-value are missing due to an issue with an other issue in the pooling of the mimira_object. The Matchthem:::get.2dfcom function gives a dfcom = 999999 (another issue to be solved in my concern)
#5 What it should looks like as graphical summary
library(survival)
mydata.cox <- coxph(Surv(eventtime, status) ~ age+ is_smoker + disease_stage, mydata_uncomp) # (df mydata_uncomp is without imputation and PSM)
#with gtsummary
forestGT <-
mydata.cox %>%
tbl_regression(exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE) %>%
plot()
(forestGT) # See picture GT_plot1. Almost perfect. Would have been great to know how to add N, 95% CI, HR, p-value and parameters of the model (AIC, events, concordance, etc.)
#with survminer
HRforest <-
survminer::ggforest(mydata.cox, data = mydata_uncomp)
(HRforest) # See picture Ggforest. Everything I need to know about my cox regression is all in there. For me it is just a great regression cox forest plot.
#6 Actually what happens when I do the same thing with imputed and matched df
#with gtsummary
forestGT_imp_psm <-
mimira_object %>%
tbl_regression(exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE) %>%
plot() # WARNING message : In get.dfcom(object, dfcom) : Infinite sample size assumed.
(forestGT_imp_psm) # See picture GT_plot2. The plot is rendered but without 95% IC
#with survminer
HRforest_imp_psm <-
ggforest(mimira_object, data = mydata_imp_m3_psm) # ERROR:in ggforest(mimira_object, data = mydata_imp_m3_psm) : inherits(model, "coxph") is not TRUE
(HRforest_imp_psm)
#7 The lucky and providential step
# your solution/advise
Would greatly appreciate your help.
cheers.
AK
Picture GT_plot1
(not allowed to embed images in this post, here is sharelink : GT_plot1
Picture Ggforest_plot
Ggforest_plot
Picture GT_plot2
GT_plot2
It seems that there are two distinct problems here:
Problem #1. getting gtsummary() to produce a table with p values and confidence intervals of the pooled, matched data
Problem #2. producing a ggforest() to produce a plot of the pooled estimates.
Problem #1:
Let us follow the instructions in the paper "MatchThem:: Matching and Weighting after Multiple Imputation" (https://arxiv.org/ftp/arxiv/papers/2009/2009.11772.pdf) [page 15]
and modify your block #3. Instead of calling pool_and_tidy_mice() we do the following:
matched.results <- pool(mimira_object)
summary(matched.results, conf.int = TRUE)
This produces the following:
term estimate std.error statistic df p.value 2.5 % 97.5 %
1 age -0.0005997864 0.001448251 -0.4141453 55.266353 6.803707e-01 -0.003501832 0.00230226
2 is_smoker 1.1157796620 0.077943244 14.3152839 9.961064 5.713387e-08 0.942019234 1.28954009
3 disease_stage 0.2360965310 0.051799813 4.5578645 3.879879 1.111782e-02 0.090504018 0.38168904
This means that performing the imputation with mice and then matching with MatchThem works, since you do get the p values and the confidence intervals.
Compare to the output from pool_and_tidy_mice():
term estimate std.error statistic p.value b df dfcom fmi lambda m
1 age -0.0005997864 0.001448251 -0.4141453 NaN 2.992395e-07 NaN Inf NaN 0.1902260 3
2 is_smoker 1.1157796620 0.077943244 14.3152839 NaN 2.041627e-03 NaN Inf NaN 0.4480827 3
3 disease_stage 0.2360965310 0.051799813 4.5578645 NaN 1.444843e-03 NaN Inf NaN 0.7179644 3
riv ubar
1 0.2349124 1.698446e-06
2 0.8118657 3.352980e-03
3 2.5456522 7.567636e-04
Where everything is the same except for df and p.value which were not calculated in the latter table.
I therefore think this is an issue with the pool_and_tidy_mice() and you should post this as an issue on GitHub at gtsummary.
For right now, you can bypass this problem by changing svycoxph() to survival::coxph() in block #3 when you call the with() function. If you do that, then eventually you will get a gtsummary table with p.values and confidence intervals. Ultimately, the problem is probably some interaction between svycoxph() and pool_and_mice(), hence why I believe that you should post this on GitHub.
Problem #2:
The short answer is that there cannot be a ggforest plot with all the data that you are looking for.
https://www.rdocumentation.org/packages/mice/versions/3.13.0/topics/pool reads:
A common error is to reverse steps 2 and 3, i.e., to pool the multiply-imputed data instead of the estimates. Doing so may severely bias the estimates of scientific interest and yield incorrect statistical intervals and p-values. The pool() function will detect this case.
This means that there is no "real" dataset for the pooled estimates (i.e. you cannot really combine the datasets for imputations 1-3), which means that ggforest() cannot compute the desired plot (since it needs to have a dataset and that cannot be used because it would lead to erroneous estimates).
What you could do, is present all the ggforest plots for each imputation (so if you did 3 imputations, you will get 3 slightly different ggforest plots) and finally add the pooled estimates plot by using plot() as suggested above.
To create each ggforest plot you need the following line of code:
ggforest(mimira_object$analyses[[1]], complete(mydata_imp_m3_psm, 1))
This will create the ggforest plot for your first imputation. Change the numbers to 2 and 3 to check the remaining imputations.
I hope this helped,
Alex
If you provide a reproducible example (i.e. an example on a data set that we can all run on our machines), we can better help you out.
The gtsummary package exports a plot() method you can use to construct a forest plot. Example below!
library(gtsummary)
library(survival)
ggforest <-
coxph(Surv(ttdeath, death) ~ trt + grade, trial) %>%
tbl_regression(exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE) %>%
plot()
#> Registered S3 method overwritten by 'GGally':
#> method from
#> +.gg ggplot2
ggforest
Created on 2021-08-26 by the reprex package (v2.0.1)

Look up BernoulliNB Probability in Dataframe

I have some training data (TRAIN) and some test data (TEST).
Each row of each dataframe contains an observed class (X) and some columns of binary (Y). BernoulliNB predicts the probability of X given Y in the test data based on the training data. I am trying to look up the probability of the observed class of each row in the test data (Pr).
Edit: I used Antoine Zambelli's advice to fix the code:
from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()
# Training Data
TRAIN = pd.DataFrame({'X' : [1,2,3,9],
'Y1': [1,1,0,0],
'Y4': [1,0,0,0]})
# Test Data
TEST = pd.DataFrame({'X' : [5,0,1,1,1,2,2,2,2],
'Y1': [1,1,0,1,0,1,0,0,0],
'Y2': [1,0,1,0,1,0,1,0,1],
'Y3': [1,1,0,1,1,0,0,0,0],
'Y4': [1,1,0,1,1,0,0,0,0]})
# Add the information that TRAIN has none of the missing items
diff_cols = set(TEST.columns)-set(TRAIN.columns)
for i in diff_cols:
TRAIN[i] = 0
# Split the data
Se_Tr_X = TRAIN['X']
Se_Te_X = TEST ['X']
df_Tr_Y = TRAIN .drop('X', axis=1)
df_Te_Y = TEST .drop('X', axis=1)
# Train: Bernoulli Naive Bayes Classifier
A_F = BNB.fit(df_Tr_Y, Se_Tr_X)
# Test: Predict Probability
Ar_R = BNB.predict_proba(df_Te_Y)
df_R = pd.DataFrame(Ar_R)
# Rename the columns after the classes of X
df_R.columns = BNB.classes_
df_S = df_R .join(TEST)
# Look up the predicted probability of the observed X
# Skip X's that are not in the training data
def get_lu(df):
def lu(i, j):
return df.get(j, {}).get(i, np.nan)
return lu
df_S['Pr'] = [*map(get_lu(df_R), df_S .T, df_S .X)]
This seemed to work, giving me the result (df_S):
This correctly gives a "NaN" for the first 2 rows because the training data contains no information about classes X=5 or X=0.
Ok, there's a couple issues here. I have a full working example below, but first those issues. Mainly the assertion that "This correctly gives a "NaN" for the first 2 rows".
This ties back to the way classification algorithms are used and what they can do. The training data contains all the information you want your algorithm to know and be able to act on. The test data is only going to be processed with that information in mind. Even if you (the person) know that the test label is 5 and not included in the training data, the algorithm doesn't know that. It is only going to look at the feature data and then try to predict the label from those. So it can't return nan (or 5, or anything not in the training set) - that nan is coming from your work going from df_R to df_S.
This leads to the second issue which is the line df_Te_Y = TEST .iloc[ : , 1 : ], that line should be df_Te_Y = TEST .iloc[ : , 2 : ], so that it does not include the label data. Label data only appears in the training set. The predicted labels will only ever be drawn from the set of labels that appear in the training data.
Note: I've changed the class labels to be Y and the feature data to be X because that's standard in the literature.
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
import pandas as pd
BNB = BernoulliNB()
# Training Data
train_df = pd.DataFrame({'Y' : [1,2,3,9], 'X1': [1,1,0,0], 'X2': [0,0,0,0], 'X3': [0,0,0,0], 'X4': [1,0,0,0]})
# Test Data
test_df = pd.DataFrame({'Y' : [5,0,1,1,1,2,2,2,2],
'X1': [1,1,0,1,0,1,0,0,0],
'X2': [1,0,1,0,1,0,1,0,1],
'X3': [1,1,0,1,1,0,0,0,0],
'X4': [1,1,0,1,1,0,0,0,0]})
X = train_df.drop('Y', axis=1) # Known training data - all but 'Y' column.
Y = train_df['Y'] # Known training labels - just the 'Y' column.
X_te = test_df.drop('Y', axis=1) # Test data.
Y_te = test_df['Y'] # Only used to measure accuracy of prediction - if desired.
Ar_R = BNB.fit(X, Y).predict_proba(X_te) # Can be combined to a single line.
df_R = pd.DataFrame(Ar_R)
df_R.columns = BNB.classes_ # Rename as per class labels.
# Columns are class labels and Rows are observations.
# Each entry is a probability of that observation being assigned to that class label.
print(df_R)
predicted_labels = df_R.idxmax(axis=1).values # For each row, take the column with the highest prob in that row.
print(predicted_labels) # [1 1 3 1 3 2 3 3 3]
print(accuracy_score(Y_te, predicted_labels)) # Percent accuracy of prediction.
print(BNB.fit(X, Y).predict(X_te)) # [1 1 3 1 3 2 3 3 3], can be used in one line if predicted_label is all we want.
# NOTE: change train_df to have 'Y': [1,2,1,9] and we get predicted_labels = [1 1 9 1 1 1 9 1 9].
# So probabilities have changed.
I recommend reviewing some tutorials or other material on clustering algorithms if this doesn't make sense after reading the code.

train-test split of scikit learn resulting in features having only one unique value in train data

I am trying to train a multivariate linear regression model. I have a data set named 'main'. There are few categorical variable in this dataset. I dummified the categorical variable. Let's say the columns obtained after dummification are A, B, C, D and so on. Now when I am trying to run train-test split on this main dataset, the train dataset thus obtained has only values 0 in one of these columns. How can I overcome this problem.
The code which I am using is :
for train-test split:
from sklearn.model_selection import train_test_split
np.random.seed(0)
df_train, df_test = train_test_split(main, train_size = 0.7, test_size = 0.3, random_state = 100)
On running the below code :
main.columns[main.nunique() == 1]
The result is : Index([], dtype='object')
And when running the below code for train data :
df_train.columns[df_train.nunique() == 1]
The result is : Index(['A', 'D', 'S'], dtype='object')
I want the resulting train set to contain features with all combination of values in it. However, this split is giving me only one value in some features
Edit : I checked the unique values in these columns and these columns are highly unbalanced with only one value present for the positive case. I tries stratify and it needs at lease two rows of positive class. And this the case for many columns. So I cannot separately include this columnns in the train dataset as it would require writing code for all the columns. I want this to be done automatically.
Have you tried changing random_state value ?

Xgboost cox survival time entry

In the new implementation of cox ph survival model in xgboost 0.81 how does one specify start and end time of an event?
Thanks
The R equivalent function would be for example :
cph_mod = coxph(Surv(Start, Stop, Status) ~ Age + Sex + SBP, data=data)
XGBoost do not allow for start (i.e. delayed entry). If it makes sense for the application, you can always change the underlying time scale so all subjects start at time=0. However, XGBoost does allow for right censored data. It seems impossible to find any documentation/example for how to implement a Cox model, but from the source code you can read "Cox regression for censored survival data (negative labels are considered censored)."
Here is a short example for anyone who want to try XGBoost with obj="survival:cox". We can compare the results to to the scikit-learn survival package sksurv. To make XGBoost more similar to that framework we use a linear booster instead of a tree booster.
import pandas as pd
import xgboost as xgb
from sksurv.datasets import load_aids
from sksurv.linear_model import CoxPHSurvivalAnalysis
# load and inspect the data
data_x, data_y = load_aids()
data_y[10:15]
Out[586]:
array([(False, 334.), (False, 285.), (False, 265.), ( True, 206.),
(False, 305.)], dtype=[('censor', '?'), ('time', '<f8')])
# Since XGBoost only allow one column for y, the censoring information
# is coded as negative values:
data_y_xgb = [x[1] if x[0] else -x[1] for x in data_y]
data_y_xgb[10:15]
Out[3]: [-334.0, -285.0, -265.0, 206.0, -305.0]
data_x = data_x[['age', 'cd4']]
data_x.head()
Out[4]:
age cd4
0 34.0 169.0
1 34.0 149.5
2 20.0 23.5
3 48.0 46.0
4 46.0 10.0
# Since sksurv output log hazard ratios (here relative to 0 on predictors)
# we must use 'output_margin=True' for comparability.
estimator = CoxPHSurvivalAnalysis().fit(data_x, data_y)
gbm = xgb.XGBRegressor(objective='survival:cox',
booster='gblinear',
base_score=1,
n_estimators=1000).fit(data_x, data_y_xgb)
prediction_sksurv = estimator.predict(data_x)
predictions_xgb = gbm.predict(data_x, output_margin=True)
d = pd.DataFrame({'xgb': predictions_xgb,
'sksurv': prediction_sksurv})
d.head()
Out[13]:
sksurv xgb
0 -1.892490 -1.843828
1 -1.569389 -1.524385
2 0.144572 0.207866
3 0.519293 0.502953
4 1.062392 1.045287
d.plot.scatter('xgb', 'sksurv')
Note that these are predictions on the same data that was use to fit the model. It seems that XGBoost get the values right but sometimes with a linear transformation. I do not know why. Play around with base_score and n_estimators. Perhaps someone can add to this answer.