I am training a GAN with text data. When I train the discriminator, I randomly sample m positive data from the dataset and generate m negative data with the generator. I found many papers mention about details of implementation such as training epochs. About the training epochs, I have a question about sampling positive data:
Sample from the dataset (maybe shuffled) in order, when the whole dataset is covered, we call 1 epoch
Just as I did, randomly sample positive data, when the total amount of sampled data is the same size as the dataset, we call 1 epoch
Which one is right? or which one is commonly used? or which one is better?
In my opinion, an epoch is when you passed through the whole training data once. and I think in the paper also they mean a pass through the whole training set when they mention an epoch.
However, the epoch can be also defined as after processing k elements, where k can be smaller than n (the size of the training set). Such a definition might make sense when you want to capture get some evaluation about your model on the dev set, and you normally do it after every single epoch.
After all, that is my opinion and my understandings of GAN papers.
Good luck!
Related
I am working on deep learning model to detect regions of timesteps with anomalies. This model should classify each timestep as possessing the anomaly or not.
My labels are something like this:
labels = [0 0 0 1 0 0 0 0 1 0 0 0 ...]
The 0s represent 'normal' timesteps and the 1s represent the existence of an anomaly. In reality, my dataset is very very imbalanced:
My training set consists of over 7000 samples, where only 1400 samples = 20% of those contain at least 1 anomaly (timestep = 1)
I am feeding samples with 4096 timesteps each. The average number of anomalies, in the samples that contain them, is around 2. So, assuming there is an anomaly, the % of anomalous timesteps ranges from 0.02% to 0.04% for each sample.
With that said, I do need to shift from the standard binary cross entropy to something that highlights the anomalous timesteps from the anomaly free timesteps.
So, I experimented adding weights to the anomalous class in such a way that the model is forced to learn from the anomalies and not just reduce its loss from the anomaly-free timesteps. It actually worked well and the model seems to learn to detect anomalous timesteps. One problem however is that training can become quite unstable (and unpredictable), with sudden loss spikes appearing and affecting the learning process. Below, you can see the effects on the loss and metrics charts for two of my trainings:
After going through a debugging process for the trainings, I am confident that the problem comes from ocasional predictions given for the anomalous timesteps. That is, in some samples of a certain epoch, and in some anomalous timesteps, the model is giving a very low prediction, e.g. 0.01, for the 1s label (should be close to 1 ofc). Considering the very high (but supposedly necessary) weights given to the anomalous timesteps, the penaly is really extreme and the loss just skyrockets.
Going deeper, if I inspect the losses of the sample where the jump happened and look for the batch right before the loss jumped, I see that the losses are all around 10^-2 - 0.0053, 0.004, 0.0041... - not a single sample with a loss over those values. Overall, an average loss of 0.005. However, if I inspect the loss of the following batch, in that same sample, the avg. loss of the batch is already 3.6, with a part of the samples with a low loss but another part with a very high loss - e.g. 9.2, 7.7, 8.9... I can confirm that all the high losses come from the penalties given at predicting the 1s timesteps. The following batches of the same sample and some of the batches of the next epoch get affected and take some time to start decreasing again and going back to a stable learning process.
With this said, I am having this problem for some weeks already and really need some guidance in what I could try to deal with the spikes, which I assume that arise on the gradient updates associated with anomalous timesteps that are harder to learn.
I am currently using a simple 2-layer keras LSTM model with 64 units each and a dense as the last layer with a 1 unit dense layer with sigmoid activation. As for the optimizer I am using Adam. I am training with batch size 128. Some things to consider also:
I have tried changes in weights and other loss functions. Ultimately, if I reduce the weights given to the anomalous timesteps the model doesn't give so much importance to them and the loss reduces by considering only the anomalous free timesteps. I have also considered focal binary cross entropy loss but it doesn't seem to do anything that could avoid those jumps as, in the end, it is all about adding or reducing weights for certain timesteps.
My current learning rate is the Adam's default, 10⁻3. I have tried reducing the learning rate which leads to less impactful spikes (they're still there though) but the model also takes much more time or gets stuck. Not sure if it would be the way to go in this case, as the training seems to go well except for these cases. Decaying learning rate might also not make too much sense as the spikes can happen earlier in the training and not only on later epochs. Not sure if this is the way to go.
I am still investigating gradient clipping as a solution. I am still not sure on what values to use and if it is actually an effective solution for my case, but from what I understood of it, it should allow to counter those jumps resulting from those 'almost' exploding gradients.
The spikes could originate from sample noise / bad samples. However, since I am already using batch size 128 and I have already tested training with simple synthetic samples I have created and the spikes were still there, I guess it is not a problem with specific samples.
The imbalance obviously plays the bigger role here. Not sure if undersampling the majority class of samples of 4096 timesteps (like increasing from 20% to 50% the amount of samples with at least an anomalous timestep) would make a big difference here since each sample of timesteps is by itself very imbalanced as it contains around 2 timesteps with anomalies. It is a problem with the imbalance within each sample.
I know it might be quite some context but honestly I am already into my limit of trying stuff for weeks.
The solutions I am inclined to go for next are either gradient clipping or just changing my samples to be more centered around the anomalous timesteps, in such a way that it contains less anomaly free timesteps and hopefully allows for convergence without having to apply such drastic weights to anomalous timesteps. This last option is more difficult for me to opt for due to some restrictions, but I might look at it if I have nothing else available.
What do you think? I am able to provide more information if needed.
I'm currently working on a multi-class classification problem which is highly imbalanced. I want to save my model weights for best epoch but I'm confused on which metric I should choose?
Here's my training progress bar :
I am using ModelCheckpoint callback in tf.keras and monitoring val_loss as a metric to save best model weights.
As you can see in the image,
At 8th epoch I got an val_acc = 0.9845 but val_loss = 0.629 and precision and recall is also high here.
But at 3rd epoch I got val_acc = 0.9840 but val_loss = 0.590
I understand the difference is not huge but in such cases what's the ideal metric to believe on imbalanced dataset?
The most important factors are the the validation and training error.
If the validation loss (error) is going to increase so means overfitting. You must set the number of epochs as high as possible and avoid the overfitting and terminate training based on the error rates. . As long as it keeps dropping training should continue. Till model start to converge at nth epochs. Indeed it should converge quite well to a low val_loss.
Just bear in mind an epoch is one learning cycle where the learner can see the whole training data set. If you have two batches, the learner needs to go through two iterations for one epoch.
This link can be helpful.
You can divide data in 3 data sets, training, validation and evaluation. Train each network along enough number of epochs to track the training Mean Squared Error to be stuck in a minimum.
The training process uses training data-set and should be executed epoch by epoch, then calculate the Mean Squared Error of the network in each epoch for the validation set. The network for the epoch with the minimum validation MSE is selected for the evaluation process.
This can happen for several reasons. Assuming you have used proper separation of train, test and validation set and preprocessing of datasets like min-max scaler, adjusting missing values, you can do the following.
First run the model for several epoch and plot the validation loss graph.
If the loss is first reducing and after reaching a certain point it is now increasing, if the graph is in U shape, then you can do early stopping.
In other scenario, when loss is steadily increasing, early stopping won't work. In this case, add dropout layer of 0.2-0.3 in between the major layers. This will introduce randomness in the layers and will stop the model from memorising.
Now once you add dropouts, your model may suddenly start to behave strange. Tweak with activation functions and number of output nodes or Dense layer and it will eventually get right.
I was writing a neural net to train Resnet on CIFAR-10 dataset.
The paper Deep Residual Learning For Image Recognition mentions training for around 60,000 epochs.
I was wondering - what exactly does an epoch refer to in this case? Is it a single pass through a minibatch of size 128 (which would mean around 150 passes through the entire 50000 image training set?
Also how long is this expected to take to train(assume CPU only, 20-layer or 32-layer ResNet)? With the above definition of an epoch, it seems it would take a very long time...
I was expecting something around 2-3 hours only, which is equivalent to about 10 passes through the 50000 image training set.
The paper never mentions 60000 epochs. An epoch is generally taken to mean one pass over the full dataset. 60000 epochs would be insane. They use 64000 iterations on CIFAR-10. An iteration involves processing one minibatch, computing and then applying gradients.
You are correct in that this means >150 passes over the dataset (these are the epochs). Modern neural network models often take days or weeks to train. ResNets in particular are troublesome due to their massive size/depth. Note that in the paper they mention training the model on two GPUs which will be much faster than on the CPU.
If you are just training some models "for fun" I would recommend scaling them down significantly. Try 8 layers or so; even this might be too much. If you are doing this for research/production use, get some GPUs.
My training data consists of about ~700 unique samples (this is for a regression problem). The data is not shuffled, so the first N samples have the same label (say, the value 1.25), then the next M samples have a the same label (say, 2.99), etc. In total there's around 15 unique labels.
I'm using a simple CNN, as the input is an image (64x64x3). Even with no dropout or any other form of regularization, I can't get the training loss to stabilize close to zero.
What is this pattern of the learning loss an indication of? (gray line is the training loss, orange line is the validation loss).
The only indication you can get from such pattern is that the learning rate is too large, you should decrease it until the loss starts to decrease.
It seems that your learning rate is
too large, making your parameters oscillate wildly.
Things that I recommend at that point would be to:
Decrease your initial learning rate
Try another optimizer with some sort of learning rate decay (e.g. ADAM which worked good for me in such cases)
In training keras/tensorflow networks I observe losses in validation datasets much lower than in training datasets - by orders of magnitude. I am using the adam optimizer and mean_squared_error loss. The behavior is consistent over different network types and dataset sizes. I would expect mean loss in validation to greater than in train. Perhaps mean_squared_error is erroneously absolute squared error, which could account for lower values since validation is typically smaller than train? If not then what explains this behavior?
This is in part due to the fact that training error for an epoch is computed as an average over batches of training data, while validation error is computed at the end of the epoch for the entire validation set. Since training makes progress within an epoch, training error over the first batches is typically higher than over the last batches. It should not make a difference of orders of magnitude though.
It could be that your training and validation data are fundamentally different, or that you haven't preprocessed the data correctly. Make sure you standardize your data. Moreover, if you are not working with timeseries, randomly shuffle your dataset before you split into train/val set.