SSAS Data Mining: Testing and Training Data Sets...please explain - ssas

Can someone explain what happens when you split up the data set for testing and training?

Put simply, the accuracy of your data mining model is evaluated by making predictions based on your training set of which the result is already known in test set.
More information on the testing and validation of data mining models (MSDN)

To be able to test the predictive analysis model you built, you need to split your dataset into two sets: training and test datasets. These datasets should be selected at random and should be a good representation of the actual population.
Similar data should be used for both the training and test datasets.
Normally the training dataset is significantly larger than the test dataset.
Using the test dataset helps you avoid errors such as overfitting.
The trained model is run against test data to see how well the model will perform.
More Information

Related

Is it a good idea to mix the validation / testing data with the training data?

I am working with a large dataset (e.g. large for a single machine) - with 1,000,000 examples.
I split my dataset to as follows: (80% Training Data, 10% Validation Data, 10% Testing Data). Every time I retrain the model, I shuffle the data first - such that some of the data from the validation / testing set ends up into the training set and vice versa.)
My thinking is this:
Ideally I would want all possible available data for the model to learn. The more the better - for improved accuracy.
Even though 20% of the data is dedicated to validation and testing, that is still 100,000 examples per piece - (i.e. I may potentially miss out on some crucial data that exists within the validation or testing set that the previous training set may not have accounted for.)
Shuffling prevents the training set from learning order where it is not important (at least in my particular dataset).
Here is my workflow process:
The Test Accuracy is more or less the equivalent to the Validation Accuracy (plus or minus 0.5%)
Per each retrain, the results usually ends up something like this: where the accuracy keeps improving (until it runs out of total epoch), but the validation accuracy ends up stuck at a particular percentage. I then save that model. Start the retraining process again. Shuffles data occurs. The training accuracy drops, but validation accuracy jumps up. The training accuracy improves until total epoch. The validation accuracy, converges downward (still greater than the previous run).
See Example:
I plan on doing this until the training accuracy data reaches 99%. (Note: I used Keras-Tuner to find the best architecture/model for my particular problem)
I can't help but think, that I am doing something wrong by doing this. From my perspective, this is just the model eventually learning all 1,000,000 examples. It feels like "mild overfitting" because of the shuffling per each retrain.
Is it a good idea to mix the validation / testing data with the training data?
Am I wrong by doing it this way? If so, why should I not do this method? Is there a better way to approach this?
If you mix your test/validation data with training data, you then can not evaluate your model on that data, since that data has been seen by your model. The model evaluation is done on the basis of how well it is able to make predictions/classification on data which your model has not seen (assuming that the data you are using to evaluate your model is coming from the same distribution as your training data). If you also mix your test set data with training set data, you will eventually end up with really good test set accuracy since that data has been seen by your model, but it might not perform well on new unseen data coming from the same distribution.
If you are worried size of test/validation data, I suggest you further reduce the size of your test/validation data. Use 99.9% instead of 99%. Also, the random shuffling will take care of learning almost every feature of your data.
After all, my point is, never ever evaluate your model on the data it has seen before. It will always give you better results (assuming you have trained your model well untill it memorizes the training data). The validation data is used when you have multiple algorithms/models and you need to select one algorithm/model from all those available models. Here, the validation data is used to select the model. The algo/model which gives good results on validation data is selected (again you do not evaluate your model based on validation set accuracy, it is just used for the selection of the model.) Once you have selected your model based on validation set accuracy, you then evaluate it on new unseen data (called test data) and report the prediction/classification accuracy on test data as your model accuracy.

Week accuracy with testing data

I'm dealling with a data science problem, and I got this problem.
I have a labelled data (Training data) and non labelled data (Test data) and both of them have a lot of missing data.
I worked with my data and I split it to trainig data and validating data
I got a very good accuracy and a very small RMSE error between Y_validation and the predicted one ( model.predict(X_validate) ). But when I submit my solution, the RMSE error get bigger with testing data !
What can I do ?!
Firstly, you need to label your test data. If your test data is not labelled, you will not be able to gauge the accuracy. It will not return accurate error representation.
You need to understand that the training set contain a known output that the model learn from. The test data have to be labelled so that when the model returns its predictions on the test data, we are able to gauge whether the model has correctly predicted the label given to the test data.
On top of doing a train test split you can also do cross validation to improve your model performance. You can understand more from here. (https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6)
This will happen sometimes when a model doesn't generalize well. This can happen when a model over fits to training data.
Resampling or better sampling of test and train data (which as mentioned, needs to be labeled) can help you get a better generalized model.

Can choose some data in training data after doing data augmentation?

I am training a UNET for semantic segmentation but I only have 200 labeled images. Given the small size of the dataset, it definitely needs some data augmentation techniques.
I have question about the test and the validation set.
I have custom data generator which keep feeding data from folder for training model.
So what I plan to do is:
do data augmentation for the training set and keep all of it in the same folder
"randomly" pick some of training data into test and validation set (of course, before training).
I am not sure if this is fine, since we just do some simple processing (flipping, transposing, adjusting brightness)
Would it be better to separate the data first and do the augmentation for the rest of data in the training folder?

Using unlabeled dataset in Keras

Usually, when using Keras, the datasets used to train the neural network are labeled.
For example, if I have a 100,000 rows of patients with 12 field per each row, then the last field will indicate if this patient is diabetic or no (0 or 1).
And then after training is finished I can insert a new record and predict if this person is diabetic or no.
But in the case of unlabeled datasets, where I can not label the data due to some reasons, how can I train the neural network to let him know that those are the normal records and any new record that does not match this network will be malicious or not accepted ?
This is called one-class learning and is usually done by using autoencoders. You train an autoencoder on the training data to reconstruct the data itself. The labels in this case is the input itself. This will give you a reconstruction error. https://en.wikipedia.org/wiki/Autoencoder
Now you can define a threshold where the data is benign or not, depending on the reconstruction error. The hope is that the reconstruction of the good data is better than the reconstruction of the bad data.
Edit to answer the question about the difference in performance between supervised and unsupervised learning.
This cannot be said with any certainty, because I have not tried it and I do not know what the final accuracy is going to be. But for a rough estimate supervised learning will perform better on the trained data, because more information is supplied to the algorithm. However if the actual data is quite different to the training data the network will underperform in practice, while the autoencoder tends to deal better with different data. Additionally, per rule of thumb you should have 5000 examples per class to train a neural network reliably, so labeling could take some time. But you will need some data to test anyways.
It sounds like you need fit two different models:
a model for bad record detection
a model for prediction of a patient's likelihood to be diabetic
For both of these models, you will need to have labels. For the first model your labels would indicate whether the record is good or bad (malicious) and the second would be whether the patient is diabetic or not.
In order to detect bad records, you may find that simple logistic regression or SVM performs adequately.

Train Data & Test Data in Data science [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am relatively new Data science in python and was exploring some competition on data science, i am getting confused with "Training data Set" and "Test Data Set" . Some projects have merged both and some they have kept separate. What is the rationale behind having two data sets. Any advise will be helpful thanks
"Training data" and "testing data" refer to subsets of the data you wish to analyze. If a supervised machine learning algorithm is being used to do something to your data (ex. to classify data points into clusters), the algorithm needs to be "trained".
Some examples of supervised machine learning algorithms are Support Vector Machines (SVM) and Linear Regression. They can be used to classify or cluster data that has many dimensions, allowing us to clump data points that are similar together.
These algorithms need to be trained with a subset of the data (the "training set") being analyzed before they are used on the "test set". Essentially, the training provides an algorithm an opportunity to infer a general solution for some new data it gets presented, much in the same way we as humans train so we can handle new situations in the future.
Hope this helps!
A Dataset is a list of rows and can be split into training and test segments. The reason this is done is to keep a CLEAR separation between the rows of data that are used during the training process of the code (think of it like flashcards that you use to "train" a baby to learn objects) and the rows of data that are used (when you are testing the baby to learn objects). You want them to be separate in order to get an accurate score for how well the algorithm performed (e.g. the baby got 9/10 correct when tested). If you mixed the training rows and the testinrows you won't know if the baby just memorized the training results or actually knew how to recognize 9/10 new images.
Generally, datasets are given as one set because during code execution it is good to randomly select training and test sets by selecting rows randomly. That way you can run the training a few times and the test various times and can take the average. For example, the baby might get 9/10 the first time,6/10 the next, and 7/10 the last. The average accuracy would then be 73.3%. This is a better representation than just trying it once (which as you can see is not completely accurate).
Train data set is for the training of your model and after it got trained how will it be checked that how much accurate the trained model is? For that, we use test data set and we usually split the available data into two pieces 1 for training and 1 for testing.
Case 1 - when train and test datasets are merged into one - It is advised to split the whole data into train, cross-validation and test sets with ratio 60:20:20 (train:CV:test). The idea is to use train data to build the model and use CV data to test the validity of the model and parameters. Your model should never see the test data until final prediction stage. So basically, you should be using train and CV data to build the model and making it robust.
Case 2 - when train and test datasets are separate - You should split train data into train and CV data sets. Alternatively, you could perform k-fold cross-validation on train set.
In most cases, the split is done randomly. However, in cases when the data is time-dependent, then the split cannot be random.
The training set is used to build the model. This contains a set of data that has target and predictor variables. This is the data which model has already seen while training and so (after finding optimum parameters), gives good accuracy (or other model performance parameter).
Test set is used to evaluate how well the model does with data outside the training set(which model has not seen). Already developed model(during training) is used for prediction and the results are compared against the preclassifed data. The model is adjusted to minimize error on the test set.