I implemented this paper's neural net, with some differences (img below), for EEG classification; train_on_batch performance is excellent, with very low loss - but test_on_batch performance, though on same data, is poor: the net seems to always predict '1', most of the time:
TRAIN (loss,acc) VAL (loss,acc)
'0' -- (0.06269842,1) (3.7652588,0)
'1' -- (0.04473557,1) (0.3251827,1)
Data is fed as 30-sec segments (12000 timesteps) (10 mins per dataset) from 32 (= batch_size) datasets at once (img below)
Any remedy?
Troubleshooting attempted:
Disabling dropout
Disabling all regularizers (except batch-norm)
Randomly, val_acc('0','1') = (~.90, ~.12) - then back to (0,1)
Additional details:
Keras 2.2.4 (TensorFlow backend), Python 3.6, Spyder 3.3.4 via Anaconda
CuDNN LSTM stateful
CNNs pretrained, LSTMs added on afterwards (and both trained)
BatchNormalization after every CNN & LSTM layer
reset_states() applied between different datasets
squeeze_excite_block inserted after every but last CNN block
UPDATE:
Progress was made; batch_normalization and dropout are the major culprits. Major changes:
Removed LSTMs, GaussianNoise, SqueezeExcite blocks (img below)
Implemented batch_norm patch
Added sample_weights to reflect class imbalance - varied between 0.75 and 2.
Trained with various warmup schemes for both MaxPool and Input dropouts
Considerable improvements were observed - but not nearly total. Train vs. validation loss behavior is truly bizarre - flipping class predictions, and bombing the exact same datasets it had just trained on:
Also, BatchNormalization outputs during train vs. test time differ considerably (img below)
UPDATE 2: All other suspicions were ruled out: BatchNormalization is the culprit. Using Self-Normalizing Networks (SNNs) with SELU & AlphaDropout in place of BatchNormalization yields stable and consistent results.
I've inadvertently left in one non-standardized sample (in a batch of 32), with sigma=52 - which severely disrupted the BN layers; post-standardizing, I no longer observe a strong discrepancy between train & inference modes - if anything, any differences are difficult to spot.
Furthermore, the entire preprocessing was very fault - after redoing correctly, the problem no longer recurs. As a debug tip, try identifying whether any particular train dataset sharply alters layer activations during inference.
Related
I am using tensorflow 2.5.0 and implemented semantic segmatation network. used DeepLab_v3_plus network with ResNet101 backbone, adam optimizer and Categorical cross entropy loss to train network. I have first build code for single gpu and achieved test accuracy (mean_iou) of 54% trained for 96 epochs. Then added tf MirroredStrategy (one machine) in code to support for multi gpu training. Surprisingly with 2 gpus, training for 48 epochs, test mean_iou is just 27% and training with 4 gpus, for 24 epochs, test mean_iou can around 12% for same dataset.
Code I have modified to support multi-gpu training from single-gpu training.
By following tensorflow blog for distributed training, created mirrored strategy and created model, model compilation and dataset_generator inside strategy scope. As per my understanding, by doing so, model.fit() method will take care of synchronization of gradients and distributing data on each gpus for training. Though code was running without any error, and also training time reduced compared to single gpu for same number of image training, test mean_iou keep getting worst with more number of gpus.
Replaced BatchNormalization with SyncBatchNormalization, but no improvement.
used warmup learning rate with linear scaling of learning rate with number of gpus, but no improvement.
in cross entropy loss, used both losses_utils.ReductionV2.AUTO and losses_utils.ReductionV2.NONE.
loss = ce(y_true, y_pred)
# reshape loss for each sample (BxHxWxC -> BxN)
# Normalize loss by number of non zero elements and sum for each sample and mean across all samples.
using .AUTO/.NONE options, I am not scaling loss by global_batch_size understanding tf will take care of it and I am already normalizing for each gpus. but with both options, didn't get any luck.
changed data_generator to tf.data.Dataset obj. Though it has helped in training time, but test mean_iou become even worst.
I would appreciate if any lead or suggestion for improving test_iou in distributed training.
let me know if you need any additional details.
Thank you
I am dealing with an object detection problem and using a model which is actually functioning (its results have been published on a paper and I have the original code). Originally, the code was written with Keras 2.2.4 without importing TensorFlow and trained and tested on the same dataset that I am using at the moment. However, when I try to run the same model with TensorFlow 2.x it just won't learn a thing.
I have tried importing everything from TensorFlow 2.4, but I have the same problem if I import everything (layers, models, optimizers...) from Keras 2.4. And I have tried to do so on two different devices, both using a GPU. Namely, what is happening is that the loss function decreases ridiculously fast, but the accuracy won't increase a bit (or, if it does, it gets stuck around 10% or smth). Also, every now and then this happens from an epoch to the next one:
Loss undergoes HUGE jumps between consecutive epochs, and all this without any changes in accuracy
I have tried to train the network on another dataset (had to change the last layers in order to match the required dimensions) and the model seemed to be learning in a normal way, i.e. the accuracy actually increases and the loss doesn't reach 0.0x in one epoch.
I can't post the script, but the model is an Encoder-Decoder network: consecutive Convolutions with increasing number of filters reduce the dimensions of the image, and a specular path of Transposed Convolutions restores the original dimensions. So basically the network only contains:
Conv2D
Conv2DTranspose
BatchNormalization
Activation("relu")
Activation("sigmoid")
concatenate
6 is used to put together outputs from parallel paths or distant layers; 3 and 4 are used after every Conv or ConvTranspose; 5 is only used as final activation function, i.e. as output layer.
I think the problem is pretty generic and I am honestly surprised that I couldn't find a single question about it. What could be happening here? The problem must have something to do with TF/Keras versions, but I can't find any documentation about it and I have been trying to change so many things but nothing changes. It's crazy because if I didn't know that the model works I would try to rewrite it from scratch so I am afraid that this problem may occurr with a new network and I won't be able to understand whether it's the libraries or the model itself.
Thank you in advance! :)
EDIT
Code snippets:
Convolutional block:
encoder1 = Conv2D(filters=first_layer_channels, kernel_size=2, strides=2)(input)
encoder1 = BatchNormalization()(encoder1)
encoder1 = Activation('relu')(encoder1)
Decoder
decoder1 = Conv2DTranspose(filters=first_layer_channels, kernel_size=2, strides=2)(encoder4)
decoder1 = BatchNormalization()(decoder1)
decoder1 = Activation('relu')(decoder1)
Final layers:
final = Conv2D(filters=total, kernel_size=1)(decoder4)
final = BatchNormalization()(final)
Last_Conv = Activation('sigmoid')(final)
The task is human pose estimation: the network (which, I recall, works on this specific task with Keras 2.2.4) has to predict twenty binary maps containing the positions of specific keypoints.
I am starting to learn Convolutional Neural Networks and have designed the famous MNIST and fashion-MNIST models and obtained good accuracy.
But then I moved to another trivial dataset that is cat vs. Dog dataset from Kaggle, but after applying all my concepts, I learned from Stanford lectures and Andrew ng lectures I was only able to get 80% accuracy. So, I decided to try the GoogleNet and Alexnet, but these model were not able to give me accuracy anything above 50% on 6 epochs.
I wanted to know whether the GoogleNet and ImageNet are designed for 1000 categories output and won't work on 2 categories output?
While making my own model I obtained an accuracy of 80%. I expected the famous GoogleNet model to give me more accuracy, but that's not the case.
Below is the GoogleNet model that I am using:
data=[]
labels=[]
for i in range(0,12499):
img=cv2.imread("train/cat."+str(i)+".jpg")
res = cv2.resize(img, dsize=(224, 224), interpolation=cv2.INTER_CUBIC)
data.append(res)
labels.append(0);
img2=cv2.imread("train/dog."+str(i)+".jpg")
res2 = cv2.resize(img2, dsize=(224,224),interpolation=cv2.INTER_CUBIC)
data.append(res2)
labels.append(1);
train_data, test_data,train_labels, test_labels = train_test_split(data,
labels,
test_size=0.2,
random_state=42)
model=tf.keras.Sequential()
model.add(layers.Conv2D(64,kernel_size=3,activation='relu', input_shape=
(224,224,3)))
model.add(layers.Conv2D(64,kernel_size=3,activation='relu'))
model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))
model.add(layers.Conv2D(128,kernel_size=3,activation='relu'))
model.add(layers.Conv2D(128,kernel_size=3,activation='relu'))
model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))
model.add(layers.Conv2D(256,kernel_size=3,activation='relu'))
model.add(layers.Conv2D(256,kernel_size=3,activation='relu'))
model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))
model.add(layers.Conv2D(512,kernel_size=3,activation='relu'))
model.add(layers.Conv2D(512,kernel_size=3,activation='relu'))
model.add(layers.Conv2D(512,kernel_size=3,activation='relu'))
model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))
model.add(layers.Conv2D(512,kernel_size=3,activation='relu'))
model.add(layers.Conv2D(512,kernel_size=3,activation='relu'))
model.add(layers.Conv2D(512,kernel_size=3,activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dense(4096,activation='relu'))
model.add(Dense(4096,activation='relu'))
model.add(Dense(2,activation='softmax'))
model.compile(optimizer=tf.train.AdamOptimizer(0.001),
loss='sparse_categorical_c rossentropy',metrics=['accuracy'])
model.fit(x=train_data,y=train_labels,batch_size=32,epochs=10,
validation_data=(test_data,test_labels))
The expected accuracy of the above google model should be more than 50%, but it's ranging between 50% and 51% after 6 epochs.
p.s I changed the last dense layer to 2 instead of 1000, and I am using Keras API for tensor flow.
Any help would be appreciated.
I struggled a bit with this earlier as well.I didn't try it yet on googlenet but I tried it on Alexnet. On Alexnet I managed to get relatively ok results (83%) for cats vs dogs after following closely to the paper. Few things you may want to do:
If you refer to the CS231n notes from Fei Fei Li
http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture9.pdf
On slide 10, you will notice that the input layer should be 227 by 227 instead. They also provided the mathematical justification
why it is so.
I started to try and follow other items closely to the original
paper here:
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
These included:
As in the paper section 3.3, adding a normalization layer at the end of the first two max pooling layers. Keras has stopped supporting LRN but I added batch normalization and it works. (I ran an experiment of a model with batch normalization and without. The accuracy difference is 82% versus 62%
As in the paper section 4.2, I added two dropout layers (0.5) at the end of the two fully connected layers.
As in the paper section 5, I changed my batches to 128, SGD momentum of 0.9 and weight decay of 0.0005
As pointed above in one of the comments from your original question,
my final layer was also a single dimension with sigmoid function.
Training for 20 epochs gave me a 83% accuracy. In the original paper, they included data augmentation but I did not include it in my implementation.
Keras has a modified googlenet example. It is modified from the Xecption architecture, I believe one of the derivatives of the inception architecture.
https://keras.io/examples/vision/image_classification_from_scratch/
I have tried it and after running for 15 epochs, accuracy is about 90%
Hope this helps.
In my current project I'm using TF Hub image module along with estimator for a classification problem. As per TF Hub guidelines, I set the tags to "train" in training mode - and to None during Eval/Predict modes. Test loss/accuracy was so bad but training loss kept decreasing. After debugging for days I learnt that somehow the hub's trained model weights were not being used (seemed only the last dense layer outside hub was being reused).
To confirm where the problem is I did not pass "train" tags even for training (with no other changes) - and the problem was immediately resolved.
Grateful for all the help - many thanks!
#inside model_fn
tags_val = None
if is_training:
tags_val = {"train"}
is_training = (mode == tf.estimator.ModeKeys.TRAIN)
tf_hub_model_spec = "https://tfhub.dev/google/imagenet/inception_v3/feature_vector/1"
img_module = hub.Module(tf_hub_model_spec, trainable=is_training, tags=tags_val)
#Add final dense layer, etc
For https://tfhub.dev/google/imagenet/inception_v3/feature_vector/1, the difference between default tags (meaning the empty set) and tags={"train"} is that the latter operates batch norm in training mode (i.e., using batch statistics for normalization). If that leads to catastrophic quality loss, my first suspicion would be: are UPDATE_OPS being run with the train_op?
https://github.com/tensorflow/hub/issues/24 discusses that on the side of other issues, with code pointers.
I am using Keras with TensorFlow backend to train CNN models.
What is the between model.fit() and model.evaluate()? Which one should I ideally use? (I am using model.fit() as of now).
I know the utility of model.fit() and model.predict(). But I am unable to understand the utility of model.evaluate(). Keras documentation just says:
It is used to evaluate the model.
I feel this is a very vague definition.
fit() is for training the model with the given inputs (and corresponding training labels).
evaluate() is for evaluating the already trained model using the validation (or test) data and the corresponding labels. Returns the loss value and metrics values for the model.
predict() is for the actual prediction. It generates output predictions for the input samples.
Let us consider a simple regression example:
# input and output
x = np.random.uniform(0.0, 1.0, (200))
y = 0.3 + 0.6*x + np.random.normal(0.0, 0.05, len(y))
Now lets apply a regression model in keras:
# A simple regression model
model = Sequential()
model.add(Dense(1, input_shape=(1,)))
model.compile(loss='mse', optimizer='rmsprop')
# The fit() method - trains the model
model.fit(x, y, nb_epoch=1000, batch_size=100)
Epoch 1000/1000
200/200 [==============================] - 0s - loss: 0.0023
# The evaluate() method - gets the loss statistics
model.evaluate(x, y, batch_size=200)
# returns: loss: 0.0022612824104726315
# The predict() method - predict the outputs for the given inputs
model.predict(np.expand_dims(x[:3],1))
# returns: [ 0.65680361],[ 0.70067143],[ 0.70482892]
In Deep learning you first want to train your model. You take your data and split it into two sets: the training set, and the test set. It seems pretty common that 80% of your data goes into your training set and 20% goes into your test set.
Your training set gets passed into your call to fit() and your test set gets passed into your call to evaluate(). During the fit operation a number of rows of your training data are fed into your neural net (based on your batch size). After every batch is sent the fit algorithm does back propagation to adjust the weights in your neural net.
After this is done your neural net is trained. The problem is sometimes your neural net gets overfit which is a condition where it performs well for the training set but poorly for other data. To guard against this situation you run the evaluate() function to send new data (your test set) through your neural net to see how it performs with data it has never seen. There is no training occurring, this is purely a test. If all goes well then the score from training is similar to the score from testing.
fit(): Trains the model for a given number of epochs (this is for training time, with the training dataset).
predict(): Generates output predictions for the input samples (this is for somewhere between training and testing time).
evaluate(): Returns the loss value & metrics values for the model in test mode (this is for testing time, with the testing dataset).
While all the above answers explain what these functions : fit(), evaluate() or predict() do however more important point to keep in mind in my opinion is what data you should use for fit() and evaluate().
The most clear guideline that I came across in Machine Learning Mastery and particular quote in there:
Training set: A set of examples used for learning, that is to fit the parameters of the classifier.
Validation set: A set of examples used to tune the parameters of a classifier, for example to choose the number of hidden units in a neural network.
Test set: A set of examples used only to assess the performance of a fully-specified classifier.
: By Brian Ripley, page 354, Pattern Recognition and Neural Networks, 1996
You should not use the same data that you used to train(tune) the model (validation data) for evaluating the performance (generalization) of your fully trained model (evaluate).
The test data used for evaluate() should be unseen/not used for training(fit()) in order to be any reliable indicator of model evaluation (for generlization).
For Predict() you can use just one or few example(s) that you choose (from anywhere) to get quick check or answer from your model. I don't believe it can be used as sole parameter for generalization.
One thing which was not mentioned here, I believe needs to be specified. model.evaluate() returns a list which contains a loss figure and an accuracy figure. What has not been said in the answers above, is that the "loss" figure is the sum of ALL the losses calculated for each item in the x_test array. x_test would contain your test data and y_test would contain your labels. It should be clear that the loss figure is the sum of ALL the losses, not just one loss from one item in the x_test array.
I would say the mean of losses incurred from all iterations, not the sum. But sure, that's the most important information here, otherwise the modeler would be slightly confused.