I'm dealling with a data science problem, and I got this problem.
I have a labelled data (Training data) and non labelled data (Test data) and both of them have a lot of missing data.
I worked with my data and I split it to trainig data and validating data
I got a very good accuracy and a very small RMSE error between Y_validation and the predicted one ( model.predict(X_validate) ). But when I submit my solution, the RMSE error get bigger with testing data !
What can I do ?!
Firstly, you need to label your test data. If your test data is not labelled, you will not be able to gauge the accuracy. It will not return accurate error representation.
You need to understand that the training set contain a known output that the model learn from. The test data have to be labelled so that when the model returns its predictions on the test data, we are able to gauge whether the model has correctly predicted the label given to the test data.
On top of doing a train test split you can also do cross validation to improve your model performance. You can understand more from here. (https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6)
This will happen sometimes when a model doesn't generalize well. This can happen when a model over fits to training data.
Resampling or better sampling of test and train data (which as mentioned, needs to be labeled) can help you get a better generalized model.
Related
I am working on a research population by country based on this data set:
https://www.kaggle.com/tanuprabhu/population-by-country-2020
I learned that it's best practice to normalize the dataset before training, so I normalized the data using sklearn.preprocessing MinMaxScaler. I proceeded to train the model using the normalized dataset before saving the model.
Next, I wanted to perform predictions on new data. So I created an input file with a similar format to the training dataset. The new input data has only 2 rows (versus the training dataset which has 200 rows).
The problem that I encounter is, due to a small number of data in the new dataset, the minmaxscaler returned 1 and 0. 1 is for the bigger number, and 0 for the smaller number. When I feed this input into the model, it gave me a prediction that is too far off from the expected value.
I have also tried to apply mixmaxscaler to the new data, feed into the model, and then inverse the result. Still, I got a value that is too far from the expected value.
I have also tried to train the model without applying mixmaxscalar. I got a better result in this model, but the predicted result only respond very well when I changed certain columns with bigger values. The columns with smaller values don't have a very good response, while in real world I know that this factor is quite significant to the predicted result.
Where do I went wrong?
Any sample code on handling the input for the trained model is much appreciated.
To test what is going on I suggest that you take a row of your training data prior to scaling it. Apply the scalar and then use the result as the data for a prediction. You should get the same predicted result as the train data result value. When you apply the scalar look to see if it generates the same values as present in the training data for that row. Make sure you are using the scalar that was fit to the training set. Do not fit the scalar to the new data, just use it to transform the data.
I am working with a large dataset (e.g. large for a single machine) - with 1,000,000 examples.
I split my dataset to as follows: (80% Training Data, 10% Validation Data, 10% Testing Data). Every time I retrain the model, I shuffle the data first - such that some of the data from the validation / testing set ends up into the training set and vice versa.)
My thinking is this:
Ideally I would want all possible available data for the model to learn. The more the better - for improved accuracy.
Even though 20% of the data is dedicated to validation and testing, that is still 100,000 examples per piece - (i.e. I may potentially miss out on some crucial data that exists within the validation or testing set that the previous training set may not have accounted for.)
Shuffling prevents the training set from learning order where it is not important (at least in my particular dataset).
Here is my workflow process:
The Test Accuracy is more or less the equivalent to the Validation Accuracy (plus or minus 0.5%)
Per each retrain, the results usually ends up something like this: where the accuracy keeps improving (until it runs out of total epoch), but the validation accuracy ends up stuck at a particular percentage. I then save that model. Start the retraining process again. Shuffles data occurs. The training accuracy drops, but validation accuracy jumps up. The training accuracy improves until total epoch. The validation accuracy, converges downward (still greater than the previous run).
See Example:
I plan on doing this until the training accuracy data reaches 99%. (Note: I used Keras-Tuner to find the best architecture/model for my particular problem)
I can't help but think, that I am doing something wrong by doing this. From my perspective, this is just the model eventually learning all 1,000,000 examples. It feels like "mild overfitting" because of the shuffling per each retrain.
Is it a good idea to mix the validation / testing data with the training data?
Am I wrong by doing it this way? If so, why should I not do this method? Is there a better way to approach this?
If you mix your test/validation data with training data, you then can not evaluate your model on that data, since that data has been seen by your model. The model evaluation is done on the basis of how well it is able to make predictions/classification on data which your model has not seen (assuming that the data you are using to evaluate your model is coming from the same distribution as your training data). If you also mix your test set data with training set data, you will eventually end up with really good test set accuracy since that data has been seen by your model, but it might not perform well on new unseen data coming from the same distribution.
If you are worried size of test/validation data, I suggest you further reduce the size of your test/validation data. Use 99.9% instead of 99%. Also, the random shuffling will take care of learning almost every feature of your data.
After all, my point is, never ever evaluate your model on the data it has seen before. It will always give you better results (assuming you have trained your model well untill it memorizes the training data). The validation data is used when you have multiple algorithms/models and you need to select one algorithm/model from all those available models. Here, the validation data is used to select the model. The algo/model which gives good results on validation data is selected (again you do not evaluate your model based on validation set accuracy, it is just used for the selection of the model.) Once you have selected your model based on validation set accuracy, you then evaluate it on new unseen data (called test data) and report the prediction/classification accuracy on test data as your model accuracy.
I am working on a project for price movement forecasting and I am stuck with poor quality predictions.
At every time-step I am using an LSTM to predict the next 10 time-steps. The input is the sequence of the last 45-60 observations. I tested several different ideas, but they all seems to give similar results. The model is trained to minimize MSE.
For each idea I tried a model predicting 1 step at a time where each prediction is fed back as an input for the next prediction, and a model directly predicting the next 10 steps(multiple outputs). For each idea I also tried using as input just the moving average of the previous prices, and extending the input to input the order book at those time-steps.
Each time-step corresponds to a second.
These are the results so far:
1- The first attempt was using as input the moving average of the last N steps, and predict the moving average of the next 10.
At time t, I use the ground truth value of the price and use the model to predict t+1....t+10
This is the result
Predicting moving average
On closer inspection we can see what's going wrong:
Prediction seems to be a flat line. Does not care much about the input data.
2) The second attempt was trying to predict differences, instead of simply the price movement. The input this time instead of simply being X[t] (where X is my input matrix) would be X[t]-X[t-1].
This did not really help.
The plot this time looks like this:
Predicting differences
But on close inspection, when plotting the differences, the predictions are always basically 0.
Plot of differences
At this point, I am stuck here and running our of ideas to try. I was hoping someone with more experience in this type of data could point me in the right direction.
Am I using the right objective to train the model? Are there any details when dealing with this type of data that I am missing?
Are there any "tricks" to prevent your model from always predicting similar values to what it last saw? (They do incur in low error, but they become meaningless at that point).
At least just a hint on where to dig for further info would be highly appreciated.
Thanks!
Am I using the right objective to train the model?
Yes, but LSTM are always very tricky for forecasting time series. And are very prone to overfitting compared to other time series models.
Are there any details when dealing with this type of data that I am missing?
Are there any "tricks" to prevent your model from always predicting similar values to what it last saw?
I haven't seen your code, or the details of the LSTM you are using. Make sure you are using a very small network, and you are avoiding overfitting. Make sure that after you differenced the data - you then reintegrate it before evaluating the final forecast.
On trick to try to build a model that forecasts 10 steps ahead directly instead of building a one-step ahead model and then forecasting recursively.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am relatively new Data science in python and was exploring some competition on data science, i am getting confused with "Training data Set" and "Test Data Set" . Some projects have merged both and some they have kept separate. What is the rationale behind having two data sets. Any advise will be helpful thanks
"Training data" and "testing data" refer to subsets of the data you wish to analyze. If a supervised machine learning algorithm is being used to do something to your data (ex. to classify data points into clusters), the algorithm needs to be "trained".
Some examples of supervised machine learning algorithms are Support Vector Machines (SVM) and Linear Regression. They can be used to classify or cluster data that has many dimensions, allowing us to clump data points that are similar together.
These algorithms need to be trained with a subset of the data (the "training set") being analyzed before they are used on the "test set". Essentially, the training provides an algorithm an opportunity to infer a general solution for some new data it gets presented, much in the same way we as humans train so we can handle new situations in the future.
Hope this helps!
A Dataset is a list of rows and can be split into training and test segments. The reason this is done is to keep a CLEAR separation between the rows of data that are used during the training process of the code (think of it like flashcards that you use to "train" a baby to learn objects) and the rows of data that are used (when you are testing the baby to learn objects). You want them to be separate in order to get an accurate score for how well the algorithm performed (e.g. the baby got 9/10 correct when tested). If you mixed the training rows and the testinrows you won't know if the baby just memorized the training results or actually knew how to recognize 9/10 new images.
Generally, datasets are given as one set because during code execution it is good to randomly select training and test sets by selecting rows randomly. That way you can run the training a few times and the test various times and can take the average. For example, the baby might get 9/10 the first time,6/10 the next, and 7/10 the last. The average accuracy would then be 73.3%. This is a better representation than just trying it once (which as you can see is not completely accurate).
Train data set is for the training of your model and after it got trained how will it be checked that how much accurate the trained model is? For that, we use test data set and we usually split the available data into two pieces 1 for training and 1 for testing.
Case 1 - when train and test datasets are merged into one - It is advised to split the whole data into train, cross-validation and test sets with ratio 60:20:20 (train:CV:test). The idea is to use train data to build the model and use CV data to test the validity of the model and parameters. Your model should never see the test data until final prediction stage. So basically, you should be using train and CV data to build the model and making it robust.
Case 2 - when train and test datasets are separate - You should split train data into train and CV data sets. Alternatively, you could perform k-fold cross-validation on train set.
In most cases, the split is done randomly. However, in cases when the data is time-dependent, then the split cannot be random.
The training set is used to build the model. This contains a set of data that has target and predictor variables. This is the data which model has already seen while training and so (after finding optimum parameters), gives good accuracy (or other model performance parameter).
Test set is used to evaluate how well the model does with data outside the training set(which model has not seen). Already developed model(during training) is used for prediction and the results are compared against the preclassifed data. The model is adjusted to minimize error on the test set.
Can someone explain what happens when you split up the data set for testing and training?
Put simply, the accuracy of your data mining model is evaluated by making predictions based on your training set of which the result is already known in test set.
More information on the testing and validation of data mining models (MSDN)
To be able to test the predictive analysis model you built, you need to split your dataset into two sets: training and test datasets. These datasets should be selected at random and should be a good representation of the actual population.
Similar data should be used for both the training and test datasets.
Normally the training dataset is significantly larger than the test dataset.
Using the test dataset helps you avoid errors such as overfitting.
The trained model is run against test data to see how well the model will perform.
More Information