Should time and status in survival analysis be included in imputation process of MissForest? - missing-data

In regression and classification tasks, it seems common to include the response with other covariates when using missForest to impute the covariates. However, how about the time and status of survival analysis? Is there any way to account for the censoring?

Related

HOW to train LSTM for Multiple time series data - both for Univariate and Multivariate scenario?

I have data for hundreds of devices(pardon me, I am not specifying much detail about device and data recorded for devices). For each device, data is recorded per hour basis.
Data recorded are of 25 dimensions.
I have few prediction tasks
time series forecasting
where I am using LSTM. As because I have hundreds of devices, and each device is a time series(multivariate data), so all total my data is a Multiple time series with multivariate data.
To deal with multiple time series - my first approach is to concatenate data one after another and treat them as one time series (it can be both uni variate or multi variate) and apply LSTM and train my LSTM model.
But by this above approach(by concatenating time series data), actually I am loosing my time property of my data, so I need a better approach.
Please suggest some ideas, or blog posts.
Kindly don't confuse with Multiple time series with Multi variate time series data.
You may consider a One-fits-all model or Seq2Seq as e.g. this Google paper suggests. The approach works as follows:
Let us assume that you wanna make a 1-day ahead forecast (24 values) and you are using last 7 days (7 * 24 = 168 values) as input.
In time series analysis data is time dependent, such that you need a validation strategy that considers this time dependence, e.g. by rolling forecast approach. Separate hold-out data for testing your final trained model.
In the first step you will generate out of your many time series 168 + 24 slices (see the Google paper for an image). The x input will have length 168 and the y input 24. Use all of your generated slices for training the LSTM/GRU network and finally do prediction on your hold-out set.
Good papers on this issue:
Foundations of Sequence-to-Sequence Modeling for Time Series
Deep and Confident Prediction for Time Series at Uber
more
Kaggle Winning Solution
Kaggle Web Traffic Time Series Forecasting
List is not comprehensive, but you can use it as a starting point.

Forecasting using Multiple Regression in BigQuery

Pity Google BigQuery still doesn't have a function such as forecast() that we see in Spreadsheets-- don't look down on yet; given one has the statistical know-how, surprising amount of smoothing and seasonality can be added to forecasting on spreadsheets.
BigQuery allows you to determine Standard Deviation, correlation and intercept metrics. Using that, one can create the prediction model-- refer to this and this. But that uses Linear regression model; so we are not happy with the seasonality aspect. Question is, how can we construct Multiple regression model for prediction in BigQuery?
If Y = a1x1+a2x2+a3x3+c, do we (1) separately determine a1, a2 and a3 and finally join the queries? But what about the intercept? How do we calculate one for MR in Bigquery?
Any contribution will be greatly appreciated...

Error propagation in a Bayesian analysis of a Markov chain

I'm analysing longitudinal panel data, in which individuals transition between different states in a Markov chain. I'm modelling the transition rates between states using a series of multinomial logistic regressions. This means that I end up with a very large number of regression slopes.
For each regression slope, I obtain a posterior distribution (using WinBUGS). From the posterior distribution, we get the mean, standard deviation, and 95% credible interval associated with the slope in question.
The value I am ultimately interested in is the expected first passage time ('hitting time') through the Markov chain. This is a function of all the different predictor variables, and so is built from the many regression slopes produced by the multinomial logistic regressions.
A simple approach would be to take the mean of each posterior distribution as a point-estimate for each regression slope, and solve for the expected first passage time at a series of different values of the predictor variables. I have now done this, but it is potentially misleading because it doesn't show the uncertainty around the predicted values of expected first passage time.
My question is: how can I calculate a credible interval for the expected first passage time?
My first thought was to approximate the error via simulation, by sampling individual values for the regression slopes from each posterior distribution, obtaining the expected first passage time given those values, and then plotting the standard deviation of all these simulated values. However, I feel like (a) this would make a statistician scream and (b) it doesn't take into account the fact that different posterior distributions will be correlated (it samples from each one independently).
In WinBUGS, you can actually obtain the correlations between the posterior distributions. So if the simulation idea is appropriate, I could in theory simulate the regression slope coefficients incorporating these correlations.
Is there a more direct and less approximate way to find the uncertainty? Could I, for instance, use WinBUGS to find the posterior distribution of the expected first passage time for a given set of values of the predictor variables? Rather like the answer to this question: define a new node and monitor it. I would imagine defining a series of new nodes, where each one is for a different set of actual predictor values, and monitoring each one. Does this make good statistical sense?
Any thoughts about this would be really appreciated!

How to test a machine learning model?

I want to develop a framework(for QA testing purpose) that validates a machine learning model. I had a lot of discussions with my peers and read articles from the google.
Most of the discussions or articles are telling machine learning model will evolve with the test data that we provide. correct me if I'm wrong.
What is the possibility of developing a framework that validates the machine learning model will give accurate results?
Few ways to test the model from the articles I read: Split and Multi-split technique, Metamorphic testing
Please also suggest any other approaches
QA testing of ML-based software requires additional, and rather unconventional, tests because oftentimes their outputs for a given set of inputs are not defined, deterministic, or known a priori and they produce approximations rather than exact results.
QA may be designed to test against:
naive but predictable benchmark methods: the average method in forecasting, the class-frequency-based classifier in classification, etc.
sanity checks (the outputs being feasible/rational): e.g., is the predicted age positive?
preset objective acceptance levels: e.g., is its AUCROC > 0.5?
extreme/boundary cases: e.g., thunderstorm conditions for a weather forecast model.
bias-variance tradeoff: what is its performance on in-sample and out-of-sample data? K-Fold cross-validation is useful here.
the model itself: is the coefficient of variation of its performance measure (e.g., AUCROC) from n runs on the same data for same/random train and test partitioning within a reasonable bound?
Some of these tests need performance measures. Here is a comprehensive library of them.
I think the data flow is, actually, the one that needs to be tested here such as raw input, manipulation, test output and predictions. For example, if you have a simple linear model you actually want to test the predictions produced from that model instead of the coefficients of the model. So, maybe, the high level steps are summarized as below;
Raw Input: Does the raw input make sense? Before you start manipulating, you need to be sure the raw data values are within the expected limits. For example, if you normally see 5-10% NA rate in some data, having 95% NA rate in a new batch might be an indicator that something is wrong.
Train/Predict Ready Input: Either you train a new model or feeding new data into a already trained model for prediction, you probably want to be sure that manipulated data makes sense, too. Some ML algorithms are delicate to data anomalies. You don't want to predict a credit score around thousands just because you have some data anomalies in the input.
Model Success: By this time, you should have some idea about your model success. So, you can measure the model's performance on a new test data. You can also check train and test score if they are not significantly different (i.e. Overfitting). If you're retraining, you can compare with the previous training scores. Or, you can separate some test set and compare its score.
Predictions: Finally, you need to be sure your final output makes sense before delivering to production/clients. For example, if you're revenue forecasting for a very small shop, the daily revenue predictions can't be million dollars or some negative amounts.
Full disclosure, I wrote a small Python package for this. You can check here or download as below,
pip install mlqa

SSAS - Classification - How To Split Data into: Training Set - Validation Set - Test Set

I have a set of 300,000 records of historic customer purchases data. I have started SSAS data mining project to identify best customers.
The split of data:
-90% non-buyers
-10% buyers
I have used various various algorithms of SSAS (decision trees and neural networks showed best lift) to explore my data.
The goal of the project is to identify/score customers according who is most likely to buy a product.
Currently, I have used all my records for this purpose. It feels that something is missing in the project. I am reading two books now about data mining. Both of them talk about splitting data mining into different sets; however, none of them explain HOW to actually split them.
I believe I need to split may records into 3 sets and re-run the ssas algorithms.
Main questions:
How do I split data into training, validation and test sets
1.1 What ratio of buyers and non-buyers should be in a training set?
How do I score my customers according to most likely to buy a product and least likely to buy a product.
The division of your set could be done randomly as your data set is big and the number of buyers is not too low (10%). However, if you want to be sure that your sets are representative you could take 80% of your buyers samples and 80% of non buyers samples and mix them to build a training set that contains 80% of your total data set and it has the same ratio of buyers-non buyers as the original data set which makes the subsets representative. You may want to divide your dataset not in two subsets but in three: training, crossvalidation and test. If you use a neural networkas you said you should use the crossvalidation subset to tune your model (weight decay, learning rate, momentum...).
Regarding your your second question you could use a neural network as you said and take the output, that will be in the range [0, 1] if you use a sigmoid as the activation function in the output layer, as the probability. I would also recommend you to take a look to collaborative filtering for this task because it would help you to know which products may be a customer interested in using your knowledge of other buyers with similar preferences.