Do you split your data into 2 subsets (train, hold-out/test) when using cross validation or not? - data-science

I know there are several resampling methods to avoid overfitting. In many tutorials and books (i.e. introduction to statistical learning Ch. 5 Resampling methods) I see, train and validation splits or k-folds cross validation being used. However, there are also many tutorials that combine both methods.(i.e. split the data into 3 parts: training, validation, test or use cross validation on train only on the train subsets, but still split in train and test)
This confuses me quite a bit. As I understood it, cross validation is a repeated train validation split. This way you will avoid evaluating the performance of your model(s) on only one split, which by chance might be easy (or super hard) to predict and therefore not providing you with an accurate estimate of the actual performance of your model(s). By repeating the process of splitting the data in train and validation several times and averaging the performance metrics over all these splits, you get a better view of the real performance of the model.
So why do some split in train and test subsets and then use cross validation only train and leave the test data totally hidden from the model? To me this seems "wrong". Because as I understand it now, it looks like you are reintroducing the problem cross validation is trying to solve in the first place. Or am I missing something?
Would it not be better to only use k-folds cross validation (which is basically a repeated split in train and test subsets and) on all the data and not splitting in train and test? Instead of splitting in train and test subsets and only cross validating on the train?
Thanks

Related

How to structure multi-output bayesian optimization

I am trying to use bayesian optimization for a multi-output problem, but am not 100% sure the best way to set it up.
I have a small number of inputs (5) and outputs (3-4) in my problem. For each output, I have a target value I would like to achieve. Ultimately I would like to minimize the MSE between the target vector (of 3-4 outputs) and the true outputs.
The simplest way to do this in my mind is to create a single model which models the MSE as a function the problem inputs. Here, all historical data is first compressed into the single MSE, then this is used to train the GP.
However, I would instead like to create individual models (or a combined, multi-output model) that directly models the outputs of interest, instead of the ultimate cost function (MSE). Primarily, this is because I have noticed more accurate predictive results (of the combined MSE), when first modeling the individual outputs, then creating a MSE, instead of directly modeling the MSE.
My problem arises when creating the acquisition function when I have multiple outputs. Ideally, I'd like to use expected improvement (EI) as my acquisition function. However, I'm not sure how to either 1) combine the multiple output distributions into a single distribution (representing the probability of the combined MSE), which can then be used to determine overall EI or 2) how to combine multiple EI values into a single metric (i.e. combine the E.I. for each output into a unified E.I.).
When reading about multi-output BO, the most common approach seems to be to identify a frontier of solutions, however this is not 100% applicable, as ultimately I can convert the output vector into a single MSE (and the frontier becomes a point).
Is the best approach simply to model the combined MSE directly? Or is there a way that I can model the individual outputs, then combine these modeled outputs into a reasonable acquisition function?

What is the reason for very high variations in val accuracy for multiple model runs?

I have a 2 layered Neural Network that I'm training on about 10000 features (genomic data) with about 100 samples in my data set. Now I realized that anytime I run my model (i.e. compile & fit) I get varying validation/testing accuracys even if I leave the train/test/validation split untouched. Sometimes its around 70% sometimes around 90%.
Due to the stochastic nature of the NN I anticipate some variation but could these strong fluctuations be a sign of something else?
The reason why you're seeing such a big instability with your validation accuracy is because your neural network is huge in comparison to the data you train it on.
Even with just 12 neurons per layer, you still have 12 * 10000 + 12 = 120012 parameters in your first layer. Now think about what the neural network does under the hood. It takes your 10000 inputs, it multiplies each input by some weight and then sums all these inputs. Now you provide it only 64 training examples on which the training algorithm is supposed to decide what are the correct input weights. Just based on intuition, from a purely combinatorial perspective there is going to be large amount of weight assignments that do well on your 64 training samples. And you have no guarantee that the training algorithm will pick such weight assignment that will also do well on your out-of-sample data.
Given neural network is able to represent a wide variety of functions (it's been proven that under certain assumptions it can approximate any function, that's called general approximation). To select the function you want you provide the training algorithm with data to constrain the space of all possible functions the network can represent to a subspace of functions that fit your data. However, such function is in no way guaranteed to represent the true underlying relationship between the input and the output. And especially if the number of parameters is larger than the number of samples (in this case by a few orders of magnitude), you're nearly guaranteed to see your network simply memorize the samples in your training data, simply because it has the capacity to do so and you haven't constrained it enough.
In other words, what you're seeing is overfitting. In NNs, the general rule of thumb is that you want at least a couple of times more samples than you have parameters (look in to the Hoeffding Inequality for theoretical rationale of this) and in effect the more samples you have, the less you're afraid of overfitting.
So here is a couple of possible solutions:
Use an algorithm that's more suitable for the case where you have high input dimension and low sample count, such as Kernel SVM (Support Vector Machine). With such a low sample count, it's quite possible that a Kernel SVM algorithm will achieve better and more consistent validation accuracy. (You can easily test this, they are available in the scikit-learn package, really easy to use)
If you insist on using NN - use regularization. Given the fact you already have working code, this will be easy, just add kernel_regularizer to all your layers, I would try both L1 and L2 regularization (probably separately). L1 regularization tends to push weights to zero so it might help reduce the number of parameters in your problem. L2 just tries to make all the weights small. Use your validation set to decide the best value for each regularization. You can optimize both for the best mean accuracy and also the lowest variance in accuracy on your validation data (do something like 20 training runs for each parameter value of L1 and L2 regularization, usually just trying different orders of magnitude is sufficient, e.g. 1e-4, 1e-3, 1e-2, 1e-1, 1, 1e1).
If most of your input features are not really predictive or if they are highly correlated, PCA (Principal Component Analysis) can be used to project your inputs into a much lower dimensional space (e.g. from 10000 to 20), where you'd have much smaller neural network (still I'd use L1 or L2 for regularization because even then you'd have more weights than training samples)
On a final note, the point of a testing set is to use it very sparsely (ideally only once). It should be the final reported metric after all your research and model tuning is done. You should not optimize any values on it. You should do all this on your validation set. To avoid overfitting on your validation set, look into k-fold cross validation.

Is it a good idea to mix the validation / testing data with the training data?

I am working with a large dataset (e.g. large for a single machine) - with 1,000,000 examples.
I split my dataset to as follows: (80% Training Data, 10% Validation Data, 10% Testing Data). Every time I retrain the model, I shuffle the data first - such that some of the data from the validation / testing set ends up into the training set and vice versa.)
My thinking is this:
Ideally I would want all possible available data for the model to learn. The more the better - for improved accuracy.
Even though 20% of the data is dedicated to validation and testing, that is still 100,000 examples per piece - (i.e. I may potentially miss out on some crucial data that exists within the validation or testing set that the previous training set may not have accounted for.)
Shuffling prevents the training set from learning order where it is not important (at least in my particular dataset).
Here is my workflow process:
The Test Accuracy is more or less the equivalent to the Validation Accuracy (plus or minus 0.5%)
Per each retrain, the results usually ends up something like this: where the accuracy keeps improving (until it runs out of total epoch), but the validation accuracy ends up stuck at a particular percentage. I then save that model. Start the retraining process again. Shuffles data occurs. The training accuracy drops, but validation accuracy jumps up. The training accuracy improves until total epoch. The validation accuracy, converges downward (still greater than the previous run).
See Example:
I plan on doing this until the training accuracy data reaches 99%. (Note: I used Keras-Tuner to find the best architecture/model for my particular problem)
I can't help but think, that I am doing something wrong by doing this. From my perspective, this is just the model eventually learning all 1,000,000 examples. It feels like "mild overfitting" because of the shuffling per each retrain.
Is it a good idea to mix the validation / testing data with the training data?
Am I wrong by doing it this way? If so, why should I not do this method? Is there a better way to approach this?
If you mix your test/validation data with training data, you then can not evaluate your model on that data, since that data has been seen by your model. The model evaluation is done on the basis of how well it is able to make predictions/classification on data which your model has not seen (assuming that the data you are using to evaluate your model is coming from the same distribution as your training data). If you also mix your test set data with training set data, you will eventually end up with really good test set accuracy since that data has been seen by your model, but it might not perform well on new unseen data coming from the same distribution.
If you are worried size of test/validation data, I suggest you further reduce the size of your test/validation data. Use 99.9% instead of 99%. Also, the random shuffling will take care of learning almost every feature of your data.
After all, my point is, never ever evaluate your model on the data it has seen before. It will always give you better results (assuming you have trained your model well untill it memorizes the training data). The validation data is used when you have multiple algorithms/models and you need to select one algorithm/model from all those available models. Here, the validation data is used to select the model. The algo/model which gives good results on validation data is selected (again you do not evaluate your model based on validation set accuracy, it is just used for the selection of the model.) Once you have selected your model based on validation set accuracy, you then evaluate it on new unseen data (called test data) and report the prediction/classification accuracy on test data as your model accuracy.

Reducing false positive in CNN (Conv1D) text classification model

I created a char-based CNN model for text classification on keras + tensorflow - mainly using Conv1D, mainly based on:
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
The model is performing very good with 80%+ accuracy on test data set. However I'm having problem with false positive. One of the reason could be that the final layer is a Dense layer with softmax activation function.
To give an idea of how the model is performing, I train the model with data set with 31 classes with 1021 samples, the performance is ~85% on 25% test data set
However if you include false negative the performance is pretty bad (I didn't run another test data with false negative since it's pretty obvious just testing by hand) - every input has a corresponding prediction. For example a sentence acasklncasdjsandjas can result in a class ask_promotion.
Are there any best practice on how to deal with false positive in this case?
My idea is to:
Implement a noise class where samples are just a set of totally random text. However this doesn't seem to help since the noise doesn't contain any pattern thus it would be difficult to train the model
Replace softmax with something that doesn't require all output probability to 1 so small values can stay small regardless of other values. I did some research on this but there's not much information on changing the activation function for this specific case
That sounds like the issue of imbalanced data, where two classes have completely different supports (the number of instances in each class). This issue is particularly crucial in the task of hierarchical classification in which some classes with a deep hierarchy tend to have much more instances than the others.
Anyway, let's simply the issue as binary classification, and name the class with much more support Class-A and the other one with less support Class-B. Generally speaking, there are two popular ways to circumvent this issue.
Under-sampling: You fix Class-B as is. Then you sample instances from Class-A for the same amount as Class-B. Combine these instances and train your classifier with them.
Over-sampling: You fix Class-A as is. Then you sample instances from Class-B for the same amount as Class-A. The same goes with Choice 1.
For more information, please refer to this KDNuggets page.
https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html
Hope this helps. :P

Train Data & Test Data in Data science [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am relatively new Data science in python and was exploring some competition on data science, i am getting confused with "Training data Set" and "Test Data Set" . Some projects have merged both and some they have kept separate. What is the rationale behind having two data sets. Any advise will be helpful thanks
"Training data" and "testing data" refer to subsets of the data you wish to analyze. If a supervised machine learning algorithm is being used to do something to your data (ex. to classify data points into clusters), the algorithm needs to be "trained".
Some examples of supervised machine learning algorithms are Support Vector Machines (SVM) and Linear Regression. They can be used to classify or cluster data that has many dimensions, allowing us to clump data points that are similar together.
These algorithms need to be trained with a subset of the data (the "training set") being analyzed before they are used on the "test set". Essentially, the training provides an algorithm an opportunity to infer a general solution for some new data it gets presented, much in the same way we as humans train so we can handle new situations in the future.
Hope this helps!
A Dataset is a list of rows and can be split into training and test segments. The reason this is done is to keep a CLEAR separation between the rows of data that are used during the training process of the code (think of it like flashcards that you use to "train" a baby to learn objects) and the rows of data that are used (when you are testing the baby to learn objects). You want them to be separate in order to get an accurate score for how well the algorithm performed (e.g. the baby got 9/10 correct when tested). If you mixed the training rows and the testinrows you won't know if the baby just memorized the training results or actually knew how to recognize 9/10 new images.
Generally, datasets are given as one set because during code execution it is good to randomly select training and test sets by selecting rows randomly. That way you can run the training a few times and the test various times and can take the average. For example, the baby might get 9/10 the first time,6/10 the next, and 7/10 the last. The average accuracy would then be 73.3%. This is a better representation than just trying it once (which as you can see is not completely accurate).
Train data set is for the training of your model and after it got trained how will it be checked that how much accurate the trained model is? For that, we use test data set and we usually split the available data into two pieces 1 for training and 1 for testing.
Case 1 - when train and test datasets are merged into one - It is advised to split the whole data into train, cross-validation and test sets with ratio 60:20:20 (train:CV:test). The idea is to use train data to build the model and use CV data to test the validity of the model and parameters. Your model should never see the test data until final prediction stage. So basically, you should be using train and CV data to build the model and making it robust.
Case 2 - when train and test datasets are separate - You should split train data into train and CV data sets. Alternatively, you could perform k-fold cross-validation on train set.
In most cases, the split is done randomly. However, in cases when the data is time-dependent, then the split cannot be random.
The training set is used to build the model. This contains a set of data that has target and predictor variables. This is the data which model has already seen while training and so (after finding optimum parameters), gives good accuracy (or other model performance parameter).
Test set is used to evaluate how well the model does with data outside the training set(which model has not seen). Already developed model(during training) is used for prediction and the results are compared against the preclassifed data. The model is adjusted to minimize error on the test set.