How to feed normalized new data to saved trained neural network model and then inverse the result? - tensorflow

I am working on a research population by country based on this data set:
https://www.kaggle.com/tanuprabhu/population-by-country-2020
I learned that it's best practice to normalize the dataset before training, so I normalized the data using sklearn.preprocessing MinMaxScaler. I proceeded to train the model using the normalized dataset before saving the model.
Next, I wanted to perform predictions on new data. So I created an input file with a similar format to the training dataset. The new input data has only 2 rows (versus the training dataset which has 200 rows).
The problem that I encounter is, due to a small number of data in the new dataset, the minmaxscaler returned 1 and 0. 1 is for the bigger number, and 0 for the smaller number. When I feed this input into the model, it gave me a prediction that is too far off from the expected value.
I have also tried to apply mixmaxscaler to the new data, feed into the model, and then inverse the result. Still, I got a value that is too far from the expected value.
I have also tried to train the model without applying mixmaxscalar. I got a better result in this model, but the predicted result only respond very well when I changed certain columns with bigger values. The columns with smaller values don't have a very good response, while in real world I know that this factor is quite significant to the predicted result.
Where do I went wrong?
Any sample code on handling the input for the trained model is much appreciated.

To test what is going on I suggest that you take a row of your training data prior to scaling it. Apply the scalar and then use the result as the data for a prediction. You should get the same predicted result as the train data result value. When you apply the scalar look to see if it generates the same values as present in the training data for that row. Make sure you are using the scalar that was fit to the training set. Do not fit the scalar to the new data, just use it to transform the data.

Related

Injecting input data in the output layer

I'm building a model using Tensorflow where the input is a slice of the output. Think of the output layer as a 2D array. One row of that array is the input data. The neural network currently tries to connect the input to the output using a mean-squared error loss function. It's doing a fairly good job, but the accuracy needs to be improved a little.
To do that, I'm trying to add another physics-based loss function. If I can have the network place the input slice in its correct location in the output, that would greatly simplify the problem as each row in the output 2D array depends on the two rows above it.
I hope this makes sense.

Best way to evaluate performance with tf.data.Dataset

I trained a model and now want to evaluate its performance on a test set. The test set is loaded as tf.data.TFRecordDataset object (from multiple TFRecords with multiple examples in each of them) which consists of ~million examples in the form of tuples (image, label), the data are batched. The raw labels are then mapped to the target integers (one-hot encoded) that the model needs to predict.
I understand that I can pass the Dataset object as an input to model.predict() which will output predictions for each example in the dataset. However, to compute some metric I need to compare true target values to the predicted ones, and to obtain the former ones I need to iterate through the Dataset, cause all true labels are stored in there.
This seems like a common task but I couldn't find a straightforward solution that works for huge dataset in TFRecord format. What would be the best way to compute, for instance, AUC per class in this case? Should I use Callbacks with model.predict(test_dataset)? Or should I process each example one by one in a loop, save true and predicted values into arrays and then use, for example, sklearn.metrics.roc_auc_score() to compute AUC scores for the two arrays? Or maybe I'm missing some obvious way to do it?
Thanks in advance!
If you need all labels, why not just:
model.evaluate(test_dataset.take(-1))
or if your ds is too large for this action, just iterate over your dataset, calculate your metric and the mean at the end.

Week accuracy with testing data

I'm dealling with a data science problem, and I got this problem.
I have a labelled data (Training data) and non labelled data (Test data) and both of them have a lot of missing data.
I worked with my data and I split it to trainig data and validating data
I got a very good accuracy and a very small RMSE error between Y_validation and the predicted one ( model.predict(X_validate) ). But when I submit my solution, the RMSE error get bigger with testing data !
What can I do ?!
Firstly, you need to label your test data. If your test data is not labelled, you will not be able to gauge the accuracy. It will not return accurate error representation.
You need to understand that the training set contain a known output that the model learn from. The test data have to be labelled so that when the model returns its predictions on the test data, we are able to gauge whether the model has correctly predicted the label given to the test data.
On top of doing a train test split you can also do cross validation to improve your model performance. You can understand more from here. (https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6)
This will happen sometimes when a model doesn't generalize well. This can happen when a model over fits to training data.
Resampling or better sampling of test and train data (which as mentioned, needs to be labeled) can help you get a better generalized model.

time-series prediction for price forecasting (problems with predictions)

I am working on a project for price movement forecasting and I am stuck with poor quality predictions.
At every time-step I am using an LSTM to predict the next 10 time-steps. The input is the sequence of the last 45-60 observations. I tested several different ideas, but they all seems to give similar results. The model is trained to minimize MSE.
For each idea I tried a model predicting 1 step at a time where each prediction is fed back as an input for the next prediction, and a model directly predicting the next 10 steps(multiple outputs). For each idea I also tried using as input just the moving average of the previous prices, and extending the input to input the order book at those time-steps.
Each time-step corresponds to a second.
These are the results so far:
1- The first attempt was using as input the moving average of the last N steps, and predict the moving average of the next 10.
At time t, I use the ground truth value of the price and use the model to predict t+1....t+10
This is the result
Predicting moving average
On closer inspection we can see what's going wrong:
Prediction seems to be a flat line. Does not care much about the input data.
2) The second attempt was trying to predict differences, instead of simply the price movement. The input this time instead of simply being X[t] (where X is my input matrix) would be X[t]-X[t-1].
This did not really help.
The plot this time looks like this:
Predicting differences
But on close inspection, when plotting the differences, the predictions are always basically 0.
Plot of differences
At this point, I am stuck here and running our of ideas to try. I was hoping someone with more experience in this type of data could point me in the right direction.
Am I using the right objective to train the model? Are there any details when dealing with this type of data that I am missing?
Are there any "tricks" to prevent your model from always predicting similar values to what it last saw? (They do incur in low error, but they become meaningless at that point).
At least just a hint on where to dig for further info would be highly appreciated.
Thanks!
Am I using the right objective to train the model?
Yes, but LSTM are always very tricky for forecasting time series. And are very prone to overfitting compared to other time series models.
Are there any details when dealing with this type of data that I am missing?
Are there any "tricks" to prevent your model from always predicting similar values to what it last saw?
I haven't seen your code, or the details of the LSTM you are using. Make sure you are using a very small network, and you are avoiding overfitting. Make sure that after you differenced the data - you then reintegrate it before evaluating the final forecast.
On trick to try to build a model that forecasts 10 steps ahead directly instead of building a one-step ahead model and then forecasting recursively.

Keras model returns different values

To play with data, I have trained a linear regression with Keras+TensorFlow, and compared the first prediction computed in 3 different ways:
I got the weights from the model, and just used the linear regression formula p = w*X0 + b
I got predictions using the model.predict(X) method of Keras for the whole data array X and then took only the first element of it
I got prediction using the same method only for the first row of features X0 (the first sample)
In theory, all those methods should produce the very same value. However, in practice I do get values that are a bit different.
This difference is not that big, but still I wonder why is that the case, only due to float precision in python?
This is most likely due to the fact that matrix multiplications and convolutions are implemented in a way which is non-deterministic (if you change the batch size you change the order in which multiply-adds happen and since floating point numbers are not associative you get slightly different results).