How often do dropped out weights get updated

How often do dropped out weights get updated - tensorflow

I am working on a problem with little data. I am augmenting the training set, i.e. I am rotating my images by up to 12 degrees in both directions:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.rotate.html
Since I only have my work-PC to work with (only i5 CPU) my batch_size is small. So small in fact that I only process one image, with its rotations, per batch (ofc. only using tiny learning_rate).
What I need to know is, if the dropout gets updated per picture or per batch. Because if it is per batch, I would need to change tactics.

Dropout is updated per batch_size you define in model.fit(), if unspecified the default is 32.
Dropout
The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during training time
Step
A training step is one gradient update. In one step batch_size examples are processed

Related

Unstable loss in binary classification for time-series data - extremely imbalanced dataset

I am working on deep learning model to detect regions of timesteps with anomalies. This model should classify each timestep as possessing the anomaly or not.
My labels are something like this:
labels = [0 0 0 1 0 0 0 0 1 0 0 0 ...]
The 0s represent 'normal' timesteps and the 1s represent the existence of an anomaly. In reality, my dataset is very very imbalanced:
My training set consists of over 7000 samples, where only 1400 samples = 20% of those contain at least 1 anomaly (timestep = 1)
I am feeding samples with 4096 timesteps each. The average number of anomalies, in the samples that contain them, is around 2. So, assuming there is an anomaly, the % of anomalous timesteps ranges from 0.02% to 0.04% for each sample.
With that said, I do need to shift from the standard binary cross entropy to something that highlights the anomalous timesteps from the anomaly free timesteps.
So, I experimented adding weights to the anomalous class in such a way that the model is forced to learn from the anomalies and not just reduce its loss from the anomaly-free timesteps. It actually worked well and the model seems to learn to detect anomalous timesteps. One problem however is that training can become quite unstable (and unpredictable), with sudden loss spikes appearing and affecting the learning process. Below, you can see the effects on the loss and metrics charts for two of my trainings:
After going through a debugging process for the trainings, I am confident that the problem comes from ocasional predictions given for the anomalous timesteps. That is, in some samples of a certain epoch, and in some anomalous timesteps, the model is giving a very low prediction, e.g. 0.01, for the 1s label (should be close to 1 ofc). Considering the very high (but supposedly necessary) weights given to the anomalous timesteps, the penaly is really extreme and the loss just skyrockets.
Going deeper, if I inspect the losses of the sample where the jump happened and look for the batch right before the loss jumped, I see that the losses are all around 10^-2 - 0.0053, 0.004, 0.0041... - not a single sample with a loss over those values. Overall, an average loss of 0.005. However, if I inspect the loss of the following batch, in that same sample, the avg. loss of the batch is already 3.6, with a part of the samples with a low loss but another part with a very high loss - e.g. 9.2, 7.7, 8.9... I can confirm that all the high losses come from the penalties given at predicting the 1s timesteps. The following batches of the same sample and some of the batches of the next epoch get affected and take some time to start decreasing again and going back to a stable learning process.
With this said, I am having this problem for some weeks already and really need some guidance in what I could try to deal with the spikes, which I assume that arise on the gradient updates associated with anomalous timesteps that are harder to learn.
I am currently using a simple 2-layer keras LSTM model with 64 units each and a dense as the last layer with a 1 unit dense layer with sigmoid activation. As for the optimizer I am using Adam. I am training with batch size 128. Some things to consider also:
I have tried changes in weights and other loss functions. Ultimately, if I reduce the weights given to the anomalous timesteps the model doesn't give so much importance to them and the loss reduces by considering only the anomalous free timesteps. I have also considered focal binary cross entropy loss but it doesn't seem to do anything that could avoid those jumps as, in the end, it is all about adding or reducing weights for certain timesteps.
My current learning rate is the Adam's default, 10⁻3. I have tried reducing the learning rate which leads to less impactful spikes (they're still there though) but the model also takes much more time or gets stuck. Not sure if it would be the way to go in this case, as the training seems to go well except for these cases. Decaying learning rate might also not make too much sense as the spikes can happen earlier in the training and not only on later epochs. Not sure if this is the way to go.
I am still investigating gradient clipping as a solution. I am still not sure on what values to use and if it is actually an effective solution for my case, but from what I understood of it, it should allow to counter those jumps resulting from those 'almost' exploding gradients.
The spikes could originate from sample noise / bad samples. However, since I am already using batch size 128 and I have already tested training with simple synthetic samples I have created and the spikes were still there, I guess it is not a problem with specific samples.
The imbalance obviously plays the bigger role here. Not sure if undersampling the majority class of samples of 4096 timesteps (like increasing from 20% to 50% the amount of samples with at least an anomalous timestep) would make a big difference here since each sample of timesteps is by itself very imbalanced as it contains around 2 timesteps with anomalies. It is a problem with the imbalance within each sample.
I know it might be quite some context but honestly I am already into my limit of trying stuff for weeks.
The solutions I am inclined to go for next are either gradient clipping or just changing my samples to be more centered around the anomalous timesteps, in such a way that it contains less anomaly free timesteps and hopefully allows for convergence without having to apply such drastic weights to anomalous timesteps. This last option is more difficult for me to opt for due to some restrictions, but I might look at it if I have nothing else available.
What do you think? I am able to provide more information if needed.

Does SGD in Tensorflow make a move with each data point?

I assumed the "stochastic" in Stochastic Gradient Descent came from the random selection of samples within each batch. But the articles I have read on the topic seem to indicate that SGD makes a small move (weight change) with every data point. How does Tensorflow implement it?

Yes, SGD is indeed randomly sampled, but the point here is a little different.
SGD itself doesn't do the sampling. You do the sampling by batching and hopefully shuffling between each epoch.
GD means you generate gradients for each weight after forward propping the entire dataset (batchsize = cardinality, and steps per epoch = 1). If your batch size is less than the cardinality of the dataset, then you are the one doing sampling, and you are running SGD not GD.
The implementation is pretty simple, and something like
Forward prop a batch / step.
Find the gradients.
Update weights with those gradients
Back to step 1

MLP output of first layer is zero after one epoch

I've been running into an issue lately trying to train a simple MLP.
I'm basically trying to get a network to map the XYZ position and RPY orientation of the end-effector of a robot arm (6-dimensional input) to the angle of every joint of the robot arm to reach that position (6-dimensional output), so this is a regression problem.
I've generated a dataset using the angles to compute the current position, and generated datasets with 5k, 500k and 500M sets of values.
My issue is the MLP I'm using doesn't learn anything at all. Using Tensorboard (I'm using Keras), I've realized that the output of my very first layer is always zero (see image 1), no matter what I try.
Basically, my input is a shape (6,) vector and the output is also a shape (6,) vector.
Here is what I've tried so far, without success:
I've tried MLPs with 2 layers of size 12, 24; 2 layers of size 48, 48; 4 layers of size 12, 24, 24, 48.
Adam, SGD, RMSprop optimizers
Learning rates ranging from 0.15 to 0.001, with and without decay
Both Mean Squared Error (MSE) and Mean Absolute Error (MAE) as the loss function
Normalizing the input data, and not normalizing it (the first 3 values are between -3 and +3, the last 3 are between -pi and pi)
Batch sizes of 1, 10, 32
Tested the MLP of all 3 datasets of 5k values, 500k values and 5M values.
Tested with number of epoches ranging from 10 to 1000
Tested multiple initializers for the bias and kernel.
Tested both the Sequential model and the Keras functional API (to make sure the issue wasn't how I called the model)
All 3 of sigmoid, relu and tanh activation functions for the hidden layers (the last layer is a linear activation because its a regression)
Additionally, I've tried the very same MLP architecture on the basic Boston housing price regression dataset by Keras, and the net was definitely learning something, which leads me to believe that there may be some kind of issue with my data. However, I'm at a complete loss as to what it may be as the system in its current state does not learn anything at all, the loss function just stalls starting on the 1st epoch.
Any help or lead would be appreciated, and I will gladly provide code or data if needed!
Thank you
EDIT:
Here's a link to 5k samples of the data I'm using. Columns B-G are the output (angles used to generate the position/orientation) and columns H-M are the input (XYZ position and RPY orientation). https://drive.google.com/file/d/18tQJBQg95ISpxF9T3v156JAWRBJYzeiG/view
Also, here's a snippet of the code I'm using:
df = pd.read_csv('kinova_jaco_data_5k.csv', names = ['state0',
'state1',
'state2',
'state3',
'state4',
'state5',
'pose0',
'pose1',
'pose2',
'pose3',
'pose4',
'pose5'])
states = np.asarray(
[df.state0.to_numpy(), df.state1.to_numpy(), df.state2.to_numpy(), df.state3.to_numpy(), df.state4.to_numpy(),
df.state5.to_numpy()]).transpose()
poses = np.asarray(
[df.pose0.to_numpy(), df.pose1.to_numpy(), df.pose2.to_numpy(), df.pose3.to_numpy(), df.pose4.to_numpy(),
df.pose5.to_numpy()]).transpose()
x_train_temp, x_test, y_train_temp, y_test = train_test_split(poses, states, test_size=0.2)
x_train, x_val, y_train, y_val = train_test_split(x_train_temp, y_train_temp, test_size=0.2)
mean = x_train.mean(axis=0)
x_train -= mean
std = x_train.std(axis=0)
x_train /= std
x_test -= mean
x_test /= std
x_val -= mean
x_val /= std
n_epochs = 100
n_hidden_layers=2
n_units=[48, 48]
inputs = Input(shape=(6,), dtype= 'float32', name = 'input')
x = Dense(units=n_units[0], activation=relu, name='dense1')(inputs)
for i in range(1, n_hidden_layers):
x = Dense(units=n_units[i], activation=activation, name='dense'+str(i+1))(x)
out = Dense(units=6, activation='linear', name='output_layer')(x)
model = Model(inputs=inputs, outputs=out)
optimizer = SGD(lr=0.1, momentum=0.4)
model.compile(optimizer=optimizer, loss='mse', metrics=['mse', 'mae'])
history = model.fit(x_train,
y_train,
epochs=n_epochs,
verbose=1,
validation_data=(x_test, y_test),
batch_size=32)
Edit 2
I've tested the architecture with a random dataset where the input was a (6,) vector where input[i] is a random number and the output was a (6,) vector with output[i] = input[i]² and the network didn't learn anything. I've also tested a random dataset where the input was a random number and the output was a linear function of the input, and the loss converged to 0 pretty quickly. In short, it seems the simple architecture is unable to map a non-linear function.

the output of my very first layer is always zero.
This typically means that the network does not "see" any pattern in the input at all, which causes it to always predict the mean of the target over the entire training set, regardless of input. Your output is in the range of -𝜋 to 𝜋 probably with an expected value of 0, so it checks out.
My guess is that the model is too small to represent the data efficiently. I would suggest that you increase the number of parameters in the model by a factor of 10 or 100 and see if it starts seeing something. Limiting the number of parameters has a regularizing effect on the network, and strong regularization usually leads the the aforementioned derping to the mean.
I'm by no means a robotics expert, but I guess that there are a lot of situations where a small nudge in the output parameters causes a large change of the input. Let's say I'm trying to scratch my back with my left hand - the farther my hand goes to the left, the harder the task becomes, so at some point I might want to switch hands, which is a discontinuous configuration change. A bad analogy, sure, but I hope it demonstrates my hunch that there are certain places in the configuration space where small target changes cause large configuration changes.
Such large changes will cause a very large, very noisy gradient around those points. I'm not sure how well the network will work around these noisy gradients, but I would suggest as an experiment that you try to limit the training dataset to a set of outputs that are connected smoothly to one another in the configuration space of the arm, if that makes sense. Going further, you should remove any points from the dataset that are close to such configuration boundaries. To make up for that at inference time, you might instead want to sample several close-by points and choose the most common prediction as the final result. Hopefully some of those points will land in a smooth configuration area.
Also, adding batch normalization before each dense layer will help smooth the gradient and provide for more reliable training.
As for the rest of your hyperparameters:
A batch size of 32 is good, a very small batch size will make the gradient too noisy
The loss function is not critical, both MSE and MAE should work
The activation functions aren't critical, ReLU is a good default choice.
The default initializers a good enough.
Normalizing is important for Dense layers, so keep it
Train for as many epochs as you need as long as both the training and validation loss are dropping. If the validation loss hasn't dropped for 5-10 epochs you might as well stop early.
Adam is a good default choice. Start with a small learning rate and increase the learning rate at the beginning of training only if the training loss is dropping consistently over several epochs.
Further reading: 37 Reasons why your Neural Network is not working

I ended up replacing the first dense layer with a Conv1D layer and the network now seems to be learning decently. It's overfitting to my data, but that's territory I'm okay with.
I'm closing the thread for now, I'll spend some time playing with the architecture.

Large trainable embedding layer slows down training

I am training a network to classify text with a LSTM. I use a randomly initialized and trainable embedding layer for the word inputs. The network is trained with the Adam Optimizer and the words are fed into the network with a one-hot-encoding.
I noticed that the number of words which are represented in the embedding layer influences heavily the training time, but I don't understand why. Increasing the number of words in the network from 200'000 to 2'000'000 almost doubled the time for a training epoch.
Shouldn't the training only update weights which where used during the prediction of the current data point. Thus if my input sequence has always the same length, there should always happen the same number of updates, regardless of the size of the embedding layer.

The number of updates needed would be reflected in the number of epochs it takes to reach a certain precision.
If your observation is that convergence takes the same number of epochs, but each epoch takes twice as much wall clock time, then it's an indication that simply performing the embedding lookup (and writing the update of embedding table) now takes a significant part of your training time.
Which could easily be the case. 2'000'000 words times 4 bytes per float32 times the length of your embedding vector (what is it? let's assume 200) is something like 1.6 gigabytes of data that needs to be touched every minibatch. You're also not saying how you're training this (CPU, GPU, what GPU) which has a meaningful impact on how this should turn out because of e.g. cache effects, as for CPU doing the exact same number of reads/writes in a slightly less cache-friendly manner (more sparsity) can easily double the execution time.
Also, your premise is a bit unusual. How much labeled data do you have that would have enough examples of the #2000000th rarest word to calculate a meaningful embedding directly? It's probably possible, but would be unusual, in pretty much all datasets, including very large ones, the #2000000th word would be a nonce and thus it'd be harmful to include it in trainable embeddings. The usual scenario would be to calculate large embeddings separately from large unlabeled data and use that as a fixed untrainable layer, and possibly concatenate them with small trainable embeddings from labeled data to capture things like domain-specific terminology.

If I understand correctly, your network takes one-hot vectors representing words to embeddings of some size embedding_size. Then the embeddings are fed as input to an LSTM. The trainable variables of the network are both those of the embedding layer and the LSTM itself.
You are correct regarding the update of the weights in the embedding layer. However, the number of weights in one LSTM cell depends on the size of the embedding. If you look for example at the equation for the forget gate of the t-th cell,
you can see that the matrix of weights W_f is multiplied by the input x_t, meaning that one of the dimensions of W_f must be exactly embedding_size. So as embedding_size grows, so does the network size, so it takes longer to train.

In tensorflow estimator class, what does it mean to train one step?

Specifically, within one step, how does it training the model? What is the quitting condition for the gradient descent and back propagation?
Docs here: https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator#train
e.g.
mnist_classifier = tf.estimator.Estimator(model_fn=cnn_model_fn)
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": X_train},
y=y_train,
batch_size=50,
num_epochs=None,
shuffle=True)
mnist_classifier.train(
input_fn=train_input_fn,
steps=100,
hooks=[logging_hook])
I understand that training one step means that we feed the neural network model with batch_size many data points once. My questions is, within this one step, how many times does it perform gradient descent? Does it do back propagation and gradient descent just once or does it keep performing gradient descent until the model weights reach a optimal for this batch of data?

In addition to #David Parks answer, using batches for performing gradient descent is referred to as stochastic gradient descent. Instead of updating the weights after each training sample, you average over the sum of gradients of the batch and use this new gradient to update your weights.
For example, if you have 1000 trainings samples and use batches of 200, you calculate the average gradient for 200 samples, and update your weights with it. That means that you only perform 5 updates overall instead of updating your weights 1000 times. On sufficiently big data sets, you will experience a much faster training process.
Michael Nielsen has a really nice way to explain this concept in his book.

1 step = 1 gradient update. And each gradient update step requires one forward pass and one backward pass.
The stopping condition is generally left up to you and is arguably more art than science. Commonly you will plot (tensorboard is handy here) your cost, training accuracy, and periodically your validation set accuracy. The low point on validation accuracy is generally a good point to stop. Depending on your dataset validation accuracy may drop and at some point increase again, or it may simply flatten out, at which point the stopping condition often correlates with the developer's degree of impatience.
Here's a nice article on stopping conditions, a google search will turn up plenty more.
https://stats.stackexchange.com/questions/231061/how-to-use-early-stopping-properly-for-training-deep-neural-network
Another common approach to stopping is to drop the learning rate every time you compute that no change has occurred to validation accuracy for some "reasonable" number of steps. When you've effectively hit 0 learning rate, you call it quits.

The input function emits batches (when num_epochs=None, num_batches is infinite):
num_batches = num_epochs * (num_samples / batch_size)
One step is processing 1 batch, if steps > num_batches, the training will stop after num_batches.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas