I'm using Stable-Baseline to train A2C model.
My data length is 9000. So how many total_timesteps in model.learn should I set?
model.learn(total_timesteps = 9000) # ?
I did some research and some suggest like 10000, and some suggest 1 million. I'm really confused.
Any suggestions?
The total_timesteps in Stable-Baseline is the total number of samples (env steps) to train on. So it should not be necessary the same as the length of the episode you have.
Related
I'm using Optuna 2.5 to optimize a couple of hyperparameters on a tf.keras CNN model. I want to use pruning so that the optimization skips the less promising corners of the hyperparameters space. I'm using something like this:
study0 = optuna.create_study(study_name=study_name,
storage=storage_name,
direction='minimize',
sampler=TPESampler(n_startup_trials=25, multivariate=True, seed=123),
pruner=optuna.pruners.SuccessiveHalvingPruner(min_resource='auto',
reduction_factor=4, min_early_stopping_rate=0),
load_if_exists=True)
Sometimes the model stops after 2 epochs, some other times it stops after 12 epochs, 48 and so forth. What I want is to ensure that the model always trains at least 30 epochs before being pruned. I guess that the parameter min_early_stopping_rate might have some control on this but I've tried to change it from 0 to 30 and then the models never get pruned. Can someone explain me a bit better than the Optuna documentation, what these parameters in the SuccessiveHalvingPruner() really do (specially min_early_stopping_rate)?
Thanks
min_resource's explanation on the documentation says
A trial is never pruned until it executes min_resource * reduction_factor ** min_early_stopping_rate steps.
So, I suppose that we need to replace the value of min_resource with a specific number depending on reduction_factor and min_early_stopping_rate.
I started developing some LSTM-models and now have some questions about normalization.
Lets pretend I have some time series data that is roughly ranging between +500 and -500. Would it be more realistic to scale the Data from -1 to 1, or is 0 to 1 a better way, I tested it and 0 to 1 seemed to be faster. Is there a wrong way to do it? Or would it just be slower to learn?
Second question: When do I normalize the data? I split the data into training and testdata, do I have to scale / normalize this data seperately? maybe the trainingdata is only ranging between +300 to -200 and the testdata ranges from +600 to -100. Thats not very good I guess.
But on the other hand... If I scale / normalize the entire dataframe and split it after that, the data is fine for training and test, but how do I handle real new incomming data? The model is trained to scaled data, so I have to scale the new data as well, right? But what if the new Data is 1000? the normalization would turn this into something more then 1, because its a bigger number then everything else before.
To make a long story short, when do I normalize data and what happens to completely new data?
I hope I could make it clear what my problem is :D
Thank you very much!
Would like to know how to handle reality as well tbh...
On a serious note though:
1. How to normalize data
Usually, neural networks benefit from data coming from Gaussian Standard distribution (mean 0 and variance 1).
Techniques like Batch Normalization (simplifying), help neural net to have this trait throughout the whole network, so it's usually beneficial.
There are other approaches that you mentioned, to tell reliably what helps for which problem and specified architecture you just have to check and measure.
2. What about test data?
Mean to subtract and variance to divide each instance by (or any other statistic you gather by any normalization scheme mentioned previously) should be gathered from your training dataset. If you take them from test, you perform data leakage (info about test distribution is incorporated into training) and you may get false impression your algorithm performs better than in reality.
So just compute statistics over training dataset and use them on incoming/validation/test data as well.
I am playing around with DeepExplainer to get shap values for deep learning models. By following some tutorials I can get some results, i.e. what variables are pushing the model prediction from the base value, which is the average model output in training set.
I have around 5,000 observations along with 70 features. The performance of DeepExplainer is quite satisfactory. And my code is:
model0 = load_model(model_p+'health0.h5')
background = healthScaler.transform(train[healthFeatures])
e = shap.DeepExplainer(model0, background)
shap_values = e.shap_values(healthScaler.transform(test[healthFeatures]))
test2 = test[healthFeatures].copy()
test2[healthFeatures] = healthScaler.transform(test[healthFeatures])
shap.force_plot(e.expected_value[0], shap_values[0][947,:], test2.iloc[947,:])
And the plot is the following:
Here the base value is 0.012 (can also be seen through e.expected_value[0]) and very close to the output value which is 0.01.
At this point I have some questions:
1) The output value is not identical to the prediction gotten through model0.predict(test[healthFeatures])[947] = -0.103 How should I assess output value?
2) As can be seen, I am using whole training set as the background to approximate conditional expectations of SHAP values. What is the difference between using random samples from training set and entire set? Is it only related to performance issue?
Many thanks in advance!
Probably too late but stil a most common question that will benefit other begginers. To answer (1), the expected and out values will be different. the expected is, as the name suggest, is the avereage over the scores predicted by your model, e.g., if it was probability then it is the average of the probabilties that your model spits. For (2), as long as the backroung values are less then 5k, it wont change much, but if > 5k then your calculations will take days to finish.
See this (lines 21-25) for more comprehensive answers.
I'm trying to run a hyperparameter optimization script, for a convNN using Tensorflow.
As you may know, TF handling of the GPU-Memory isn't that fancy(don't think it will ever be, thanks to the TPU). So my question is how do I know to choose the filter dimensions and the batchsize, so that the GPU-memory don't get exhausted.
Here's the equation that I'm thinking of:
image_shape =128x128x3(3 color channel)
batchSitze = 20 ( is the smallest possible batchsize, since I got 20 klasses)
filter_shape= fw_fh_fd[filter_width=4, filter_height=4, filter_depth=32]
As far as understood, using tf.conv2d function will need the following amount of memory:
image_width * image_height *numerofchannel*batchSize*filter_height*filter_width*filter_depth*32bit
since we're tf.float32 type for each pixel.
in the given example, the needed memory, will be :
128x128x3x20x4x4x32x32 =16106127360 (bits), which is all most 16GB of memory.
I'm not the formula is correct, so I hope to get a validation or the a correction of what I'm missing.
Actually, this will take only about 44MB of memory, mostly taken by the output.
Your input is 20x128x128x3
The convolution kernel is 4x4x3x32
The output is 20x128x128x32
When you sum up the total, you get
(20*128*128*3 + 4*4*3*32 + 20*128*128*32) * 4 / 1024**2 ≈ 44MB
(In the above, 4 is for the size in bytes of float32 and 1024**2 is to get the result in MB).
Your batch size can be smaller than your number of classes. Think about ImageNet and its 1000 classes: people are training with batch sizes 10 times smaller.
EDIT
Here is a tensorboard screenshot of the net — it reports 40MB rather than 44MB, probably because it excludes the input — and you also have all the tensor sizes I mentioned earlier.
I am implementing a bidirectional dynamic rnn. Now I face the question whether I need to bucket my training samples.
My thought (and fear) is that if I don't bucket I might face the following situation: In a batch with 32 samples and maybe all but one samples being below 500 characters long and one samples being say 10.000 characters long the backprop will behave basically as if I had only a batch size of 1 and might result in NANs quickly or throw off the learned weights pretty badly every time that situation occurs.
Any experiences before I write code and check for days of training and debugging? Thx