Using squared difference of two images as loss function in tensorflow - tensorflow

I'm trying to use the SSD between two images as loss function for my network.
# h_fc2 is my output layer, y_ is my label image.
ssd = tf.reduce_sum(tf.square(y_ - h_fc2))
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(ssd)
Problem is, that the weights then diverge and I get the error
ReluGrad input is not finite. : Tensor had Inf values
Why's that? I did try some other stuff like normalizing the ssd by the image size (did not work) or cropping the output values to 1 (does not crash anymore, but I still need to evaluate this):
ssd_min_1 = tf.reduce_sum(tf.square(y_ - tf.minimum(h_fc2, 1)))
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(ssd_min_1)
Are my observations to be expected?
Edit:
#mdaoust suggestions proved to be correct. The main point was normalizing by batch size. This can be done independent of batch size by using this code
squared_diff_image = tf.square(label_image - output_img)
# Sum over all dimensions except the first (the batch-dimension).
ssd_images = tf.reduce_sum(squared_diff_image, [1, 2, 3])
# Take mean ssd over batch.
error_images = tf.reduce_mean(ssd_images)
With this change, only a slight decrease of the learning rate (to 0.0001) was necessary.

There are a lot of ways you can end up with non-finite results.
But optimizers, especially simple ones like gradient descent, can diverge if the learning rate is 'too high'.
Have you tried simply dividing your learning rate by 10/100/1000? Or normalizing by pixels*batch_size to get the average error per pixel?
Or one of the more advanced optimizers? For example tf.train.AdamOptimizer() with default options.

Related

My loss function starts to diverge in Tensorflow1. What can I do?

First of all I am new with DNN network. I am training a network using tensorflow 1. I used TF1 since I used a paper as benchmark of my code and it is written in TF1, although I tried to implement the same algorithm in TF2 I could not achieve the same performance that I achieved with the original version.
My problem is normally my loss function, which is the MSE, goes down however after a random number of epochs the loss function diverges to NaN. It is really strange since although the result diverges, If I train the network again with the last set of weights previous to divergence. The loss MSE continues to reduce until new divergence happens.
I read this type of error is related with numerical inestability. In the algorithm I am implementing, the network is feed several times before compute the loss function, with the network output is computed a new input which is used to feed again the network. The process is repeated several times and after that the loss function is computed. It is similar to the following code (I have simplified the code since it is quite long)
input = ...
for t in range(max_iter):
x1 = tf.nn.relu(input#A1+b1)
x1 = BatchNormalization()(x1)
x2 = tf.nn.relu(x1#A2+b2)
x2 = BatchNormalization()(x2)
x3 = tf.nn.relu(x2#A3+b3)
x3 = BatchNormalization()(x3)
output = x3#A4+b4
input = recompute_input(output)
compute_loss(input, true_values)
I suppose during this process some inputs of the network break the gradients, I have read and tryed the following solutions:
Standarize the network input
Changing the learning rate
Changing the number of batches
I have implemented all the previous solution but I have not solved the problem. Currently I am using Adam Optimizer with learning rage equal to 0.0001 and I keep having the same problem, randomly the algorithm diverges. I tried changing learning rate but it ends in divergence again so I do not how I should do.
How do you recommend me to proceed? In any case, there is some way to check the values of the tensors inside the "run" in order to analyze when the result diverges?
Thank you in advance!

Validation Loss doesnt Change

i am using a custom loss trying to decrease peak average power ratio of ofdm symbols. to break it down the input is of length N length that can take only 4 values. the output can take any floating value from [-1,1] (because i cant go over the power threshold). i generate the training and validation set randomly since it is.. the data can take any random combination of the 4 values.
The problem is changing and tweaking the model and parameters only improves the training loss, validation loss is constant from the first epoch.
I am using a custom loss function that only concatenates the output of the model and spreads it in the middle of the input and using ifft operation then calculating the max / mean of all elements.
in short its reserving some of the array elements (tones) to pick so that it removes the peaks of the input sacrificing those element but getting less peaks in the final signal.
i am sending input data as one hot encoded for each of the 4 values, and sending them once more as labels in their complex form so i can do operations on them in the custom loss function below.
def PAPR_Loss(y_true, y_pred):
Reserved_phases = [0, 32, 62, 93, 124, 155, 186, 217, 248]
data = tf.concat([tf.concat([y_true[:, Reserved_phases[i]:Reserved_phases[i+1]], tf.complex(y_pred[:, 4*(i+1)-4] - y_pred[:, 4*(i+1)-2], y_pred[:, 4*(i+1)-3] - y_pred[:, 4*(i+1)-1])[:, tf.newaxis]], 1) for i in range(L)], 1)
x = tf.signal.ifft(data)
temp = tf.square(tf.abs(x))
loss = tf.reduce_max(temp, axis=-1) / tf.reduce_mean(temp, axis=-1)
return 10*tf.experimental.numpy.log10(loss)
Loss and Validation Loss vs Epochs
i am using 80k unique data combinations as training and 20k different combinations as validation
also i am using dropout after each layer so i dont think its an overfitting problem.
when i remove the tanh activation at the output (meaning the output can take any values) i start getting improvements on the validation and better loss on training as well but i suspect this occurs because we just let the model add the mean power term which is inversly proportional to the loss but it doesnt learn where the peaks and how to cancel those peaks. it just increase the mean as much as possible so that the max isnt that big in relation to it anymore.
also could the model not train because of the concatenation and using input in a different form as a label? i thought i could get away with this since the input isnt trainable so it doesnt matter.
Note: The model doesnt even beat the classical method without using deep learning which just search in a candidate limited set for the best combinations that decrease this peaks. the problem with the classical model that it is computationally expensive if i can even match this performance this approach will be very rewarding.
what could be going wrong here? what can i try changing next?
Thanks in advance.

model.evaluate() varies wildly with number of steps when using generators

Running tensorflow 2.x in Colab with its internal keras version (tf.keras). My model is a 3D convolutional UNET for multiclass segmentation (not sure if it's relevant).
I've successfully trained (high enough accuracy on validation) this model the traditional way but I'd like to do augmentation to improve it, therefore I'm switching to (hand-written) generators. When I use generators I see my loss increasing and my accuracy decreasing a lot (e.g.: loss increasing 4-fold, not some %) in the fit.
To try to localize the issue I've tried loading my trained weights and computing the metrics on the data returned by the generators. And what's happening makes no sense. I can see that the results visually are ok.
model.evaluate(validationGenerator,steps=1)
2s 2s/step - loss: 0.4037 - categorical_accuracy: 0.8716
model.evaluate(validationGenerator,steps=2)
2s/step - loss: 1.7825 - categorical_accuracy: 0.7158
model.evaluate(validationGenerator,steps=4)
7s 2s/step - loss: 1.7478 - categorical_accuracy: 0.7038
Why would the loss vary with the number of steps? I could guess some % due to statistical variations... not 4 fold increase!
If I try
x,y = next(validationGenerator)
nSamples = x.shape[0]
meanLoss = np.zeros(nSamples)
meanAcc = np.zeros(nSamples)
for pIdx in range(nSamples):
y_pred = model.predict(np.expand_dims(x[pIdx,:,:,:,:],axis=0))
meanAcc[pIdx]=np.mean(tf.keras.metrics.categorical_accuracy(np.expand_dims(y[pIdx,:,:,:,:],axis=0),y_pred))
meanLoss[pIdx]=np.mean(tf.keras.metrics.categorical_crossentropy(np.expand_dims(y[pIdx,:,:,:,:],axis=0),y_pred))
print(np.mean(meanAcc))
print(np.mean(meanLoss))
I get accuracy~85% and loss ~0.44. Which is what I expect from the previous fit, and it varies by vary little from one batch to the other. And these are the same exact numbers that I get if I do model.evaluate() with 1 step (using the same generator function).
However I need about 30 steps to run trough my whole training dataset. What should I do?
If I fit my already good model to this generator it indeed worsen the performances a lot (it goes from a nice segmentation of the image to uniform predictions of 25% for each of the 4 classes!!!!)
Any idea on where to debud the issue? I've also visually looked at the images produced by the generator and at the model predictions and everything looks correct (as testified by the numbers I found when evaluating using a single step). I've tried writing a minimal working example with a 2 layers model but... in it the issue does not happen.
UPDATE: Generators code
So, as I've been asked, these are the generators code. They're handwritten
def dataGen (X,Y_train):
patchS = 64 #set the size of the patch I extract
batchS = 16 #number of samples per batch
nSamples = X.shape[0] #get total number of samples
immSize = X.shape[1:] #get the shape of the iamge to crop
#Get 4 patches from each image
#extract them randomly, and in random patient order
patList = np.array(range(0,nSamples),dtype='int16')
patList = patList.reshape(nSamples,1)
patList = np.tile(patList,(4,2))
patList[:nSamples,0]=0 #Use this index to tell the code where to get the patch from
patList[nSamples:2*nSamples,0]=1
patList[2*nSamples:3*nSamples,0]=2
patList[3*nSamples:4*nSamples,0]=3
np.random.shuffle(patList)
patStart=0
Xout = np.zeros((batchS,patchS,patchS,patchS,immSize[3])) #allocate output vector
while True:
Yout = np.zeros((batchS,patchS,patchS,patchS)) #allocate vector of labels
for patIdx in range(batchS):
XSR = 32* (patList[patStart+patIdx,0]//2) #get the index of where to extract the patch
YSR = 32* (patList[patStart+patIdx,0]%2)
xStart = random.randrange(XSR,XSR+32) #get a patch randomly somewhere between a range
yStart = random.randrange(YSR,YSR+32)
zStart = random.randrange(0,26)
patInd = patList[patStart+patIdx,1]
Xout[patIdx,:,:,:,:] = X[patInd,xStart:(xStart+patchS),yStart:(yStart+patchS),zStart:(zStart+patchS),:]
Yout[patIdx,:,:,:] = Y_train[patInd,xStart:(xStart+patchS),yStart:(yStart+patchS),zStart:(zStart+patchS)]
if((patStart+patIdx)>(patList.shape[0]-2)):
np.random.shuffle(patList) #after going through the whole list restart
patStart=0
patStart = patStart+batchS
Yout = tf.keras.utils.to_categorical (Yout, num_classes=4, dtype='float32') #convert to one hot encoding
yield Xout, Yout
Posting the workaround I've found for the future person coming here from google.
Apparently the issue lies in how keras calls a handwritten generator. When it was called multiple times in a row by using evaluate(gen, steps=N) apparently it returned wrong outputs. There's no documentation around about how to address this or how a generator should be written.
I ended up writing my code using a tf.keras.utils.sequence class and the same previous code now works perfectly. No way to know why.
Here are different factors that affect loss & accuracy:
For Accuracy, we know that it measures the accuracy of the prediction: i.e. correct-classes /total-classes.
While loss tracks the inverse-confidence of the prediction.
A high Loss indicates that although the model is performing well with the prediction, It is becoming uncertain of the prediction it is making.
For example, For an image classification scenario, The image of a cat is passed into two models. Model A predicts {cat: 0.8, dog: 0.2} and model B predicts {cat: 0.6, dog: 0.4}.
Both models will score the same accuracy, but model B will have a higher loss.
On your evaluation part, Based on the documentation
Steps: Integer or None. Total number of steps (batches of samples) before declaring the evaluation round finished. Ignored with the default value of None. If x is a tf.data dataset and steps is None, 'evaluate' will run until the dataset is exhausted. This argument is not supported by array inputs.
So for simplify, it's getting the Nth batch of your validation samples.
It could be that the model prediction is becoming uncertain since the majority of the unknown data falls on those specific steps. which in your case, steps 2 & 3.
So, As the evaluation steps progress, The prediction becomes more uncertain leading to a higher loss.
You might need to retrain your model with more training samples but of course, you need to be careful since you might encounter overfitting.
In terms of data augmentation, you might wanna check this link
In Training Perspective, proper data augmentation is one of the factors that leads to good model performance.

MLP output of first layer is zero after one epoch

I've been running into an issue lately trying to train a simple MLP.
I'm basically trying to get a network to map the XYZ position and RPY orientation of the end-effector of a robot arm (6-dimensional input) to the angle of every joint of the robot arm to reach that position (6-dimensional output), so this is a regression problem.
I've generated a dataset using the angles to compute the current position, and generated datasets with 5k, 500k and 500M sets of values.
My issue is the MLP I'm using doesn't learn anything at all. Using Tensorboard (I'm using Keras), I've realized that the output of my very first layer is always zero (see image 1), no matter what I try.
Basically, my input is a shape (6,) vector and the output is also a shape (6,) vector.
Here is what I've tried so far, without success:
I've tried MLPs with 2 layers of size 12, 24; 2 layers of size 48, 48; 4 layers of size 12, 24, 24, 48.
Adam, SGD, RMSprop optimizers
Learning rates ranging from 0.15 to 0.001, with and without decay
Both Mean Squared Error (MSE) and Mean Absolute Error (MAE) as the loss function
Normalizing the input data, and not normalizing it (the first 3 values are between -3 and +3, the last 3 are between -pi and pi)
Batch sizes of 1, 10, 32
Tested the MLP of all 3 datasets of 5k values, 500k values and 5M values.
Tested with number of epoches ranging from 10 to 1000
Tested multiple initializers for the bias and kernel.
Tested both the Sequential model and the Keras functional API (to make sure the issue wasn't how I called the model)
All 3 of sigmoid, relu and tanh activation functions for the hidden layers (the last layer is a linear activation because its a regression)
Additionally, I've tried the very same MLP architecture on the basic Boston housing price regression dataset by Keras, and the net was definitely learning something, which leads me to believe that there may be some kind of issue with my data. However, I'm at a complete loss as to what it may be as the system in its current state does not learn anything at all, the loss function just stalls starting on the 1st epoch.
Any help or lead would be appreciated, and I will gladly provide code or data if needed!
Thank you
EDIT:
Here's a link to 5k samples of the data I'm using. Columns B-G are the output (angles used to generate the position/orientation) and columns H-M are the input (XYZ position and RPY orientation). https://drive.google.com/file/d/18tQJBQg95ISpxF9T3v156JAWRBJYzeiG/view
Also, here's a snippet of the code I'm using:
df = pd.read_csv('kinova_jaco_data_5k.csv', names = ['state0',
'state1',
'state2',
'state3',
'state4',
'state5',
'pose0',
'pose1',
'pose2',
'pose3',
'pose4',
'pose5'])
states = np.asarray(
[df.state0.to_numpy(), df.state1.to_numpy(), df.state2.to_numpy(), df.state3.to_numpy(), df.state4.to_numpy(),
df.state5.to_numpy()]).transpose()
poses = np.asarray(
[df.pose0.to_numpy(), df.pose1.to_numpy(), df.pose2.to_numpy(), df.pose3.to_numpy(), df.pose4.to_numpy(),
df.pose5.to_numpy()]).transpose()
x_train_temp, x_test, y_train_temp, y_test = train_test_split(poses, states, test_size=0.2)
x_train, x_val, y_train, y_val = train_test_split(x_train_temp, y_train_temp, test_size=0.2)
mean = x_train.mean(axis=0)
x_train -= mean
std = x_train.std(axis=0)
x_train /= std
x_test -= mean
x_test /= std
x_val -= mean
x_val /= std
n_epochs = 100
n_hidden_layers=2
n_units=[48, 48]
inputs = Input(shape=(6,), dtype= 'float32', name = 'input')
x = Dense(units=n_units[0], activation=relu, name='dense1')(inputs)
for i in range(1, n_hidden_layers):
x = Dense(units=n_units[i], activation=activation, name='dense'+str(i+1))(x)
out = Dense(units=6, activation='linear', name='output_layer')(x)
model = Model(inputs=inputs, outputs=out)
optimizer = SGD(lr=0.1, momentum=0.4)
model.compile(optimizer=optimizer, loss='mse', metrics=['mse', 'mae'])
history = model.fit(x_train,
y_train,
epochs=n_epochs,
verbose=1,
validation_data=(x_test, y_test),
batch_size=32)
Edit 2
I've tested the architecture with a random dataset where the input was a (6,) vector where input[i] is a random number and the output was a (6,) vector with output[i] = input[i]² and the network didn't learn anything. I've also tested a random dataset where the input was a random number and the output was a linear function of the input, and the loss converged to 0 pretty quickly. In short, it seems the simple architecture is unable to map a non-linear function.
the output of my very first layer is always zero.
This typically means that the network does not "see" any pattern in the input at all, which causes it to always predict the mean of the target over the entire training set, regardless of input. Your output is in the range of -𝜋 to 𝜋 probably with an expected value of 0, so it checks out.
My guess is that the model is too small to represent the data efficiently. I would suggest that you increase the number of parameters in the model by a factor of 10 or 100 and see if it starts seeing something. Limiting the number of parameters has a regularizing effect on the network, and strong regularization usually leads the the aforementioned derping to the mean.
I'm by no means a robotics expert, but I guess that there are a lot of situations where a small nudge in the output parameters causes a large change of the input. Let's say I'm trying to scratch my back with my left hand - the farther my hand goes to the left, the harder the task becomes, so at some point I might want to switch hands, which is a discontinuous configuration change. A bad analogy, sure, but I hope it demonstrates my hunch that there are certain places in the configuration space where small target changes cause large configuration changes.
Such large changes will cause a very large, very noisy gradient around those points. I'm not sure how well the network will work around these noisy gradients, but I would suggest as an experiment that you try to limit the training dataset to a set of outputs that are connected smoothly to one another in the configuration space of the arm, if that makes sense. Going further, you should remove any points from the dataset that are close to such configuration boundaries. To make up for that at inference time, you might instead want to sample several close-by points and choose the most common prediction as the final result. Hopefully some of those points will land in a smooth configuration area.
Also, adding batch normalization before each dense layer will help smooth the gradient and provide for more reliable training.
As for the rest of your hyperparameters:
A batch size of 32 is good, a very small batch size will make the gradient too noisy
The loss function is not critical, both MSE and MAE should work
The activation functions aren't critical, ReLU is a good default choice.
The default initializers a good enough.
Normalizing is important for Dense layers, so keep it
Train for as many epochs as you need as long as both the training and validation loss are dropping. If the validation loss hasn't dropped for 5-10 epochs you might as well stop early.
Adam is a good default choice. Start with a small learning rate and increase the learning rate at the beginning of training only if the training loss is dropping consistently over several epochs.
Further reading: 37 Reasons why your Neural Network is not working
I ended up replacing the first dense layer with a Conv1D layer and the network now seems to be learning decently. It's overfitting to my data, but that's territory I'm okay with.
I'm closing the thread for now, I'll spend some time playing with the architecture.

LSTM Sequence Prediction in Keras just outputs last step in the input

I am currently working with Keras using Tensorflow as the backend. I have a LSTM Sequence Prediction model shown below that I am using to predict one step ahead in a data series (input 30 steps [each with 4 features], output predicted step 31).
model = Sequential()
model.add(LSTM(
input_dim=4,
output_dim=75,
return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(
150,
return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(
output_dim=4))
model.add(Activation("linear"))
model.compile(loss="mse", optimizer="rmsprop")
return model
The issue I'm having is that after training the model and testing it - even with the same data it trained on - what it outputs is essentially the 30th step in the input. My first thought is the patterns of my data must be too complex to accurately predict, at least with this relatively simple model, so the best answer it can return is essentially the last element of the input. To limit the possibility of over-fitting I've tried turning training epochs down to 1 but the same behavior appears. I've never observed this behavior before though and I have worked with this type of data before with successful results (for context, I'm using vibration data taken from 4 points on a complex physical system that has active stabilizers; the prediction is used in a pid loop for stabilization hence why, at least for now, I'm using a simpler model to keep things fast).
Does that sound like the most likely cause, or does anyone have another idea? Has anyone seen this behavior before? In case it helps with visualization here is what the prediction looks like for one vibration point compared to the desired output (note, these screenshots are zoomed in smaller selections of a very large dataset - as #MarcinMożejko noticed I did not zoom quite the same both times so any offset between the images is due to that, the intent is to show the horizontal offset between the prediction and true data within each image):
...and compared to the 30th step of the input:
Note: Each data point seen by the Keras model is an average over many actual measurements with the window of the average processed along in time. This is done because the vibration data is extremely chaotic at the smallest resolution I can measure so instead I use this moving average technique to predict the larger movements (which are the more important ones to counteract anyway). That is why the offset in the first image appears as many points off instead of just one, it is 'one average' or 100 individual points of offset.
.
-----Edit 1, code used to get from the input datasets 'X_test, y_test' to the plots shown above-----
model_1 = lstm.build_model() # The function above, pulled from another file 'lstm'
model_1.fit(
X_test,
Y_test,
nb_epoch=1)
prediction = model_1.predict(X_test)
temp_predicted_sensor_b = (prediction[:, 0] + 1) * X_b_orig[:, 0]
sensor_b_y = (Y_test[:, 0] + 1) * X_b_orig[:, 0]
plot_results(temp_predicted_sensor_b, sensor_b_y)
plot_results(temp_predicted_sensor_b, X_b_orig[:, 29])
For context:
X_test.shape = (41541, 30, 4)
Y_test.shape = (41541, 4)
X_b_orig is the raw (averaged as described above) data from the b sensor. This is multiplied by the prediction and input data when plotting to undo normalization I do to improve the prediction. It has shape (41541, 30).
----Edit 2----
Here is a link to a complete project setup to demonstrate this behavior:
https://github.com/ebirck/lstm_sequence_prediction
That is because for your data(stock data?), the best prediction for 31st value is the 30th value itself.The model is correct and fits the data.
I also have similar experience predicting the stock data.
I feel I should post a follow-up, since it seems this post has been getting more attention than my other questions.
Ferret Zhang's answer is correct (and has been accepted), and I find this discovery is actually quite funny when you understand it in relation to stock / cryptocurrency data which some have commented about. What sequence prediction is ultimately doing is assigning statistical weights to different moves, to pick the highest probability move and 'predict' it will happen. In the case of stock data, in a vacuum it is (at least at this scale) completely random, there is equal probability of moving up or down, and hence the model predicts that it will stay the exact same.
The model, in a sense, learned that the best way to play is to not play at all :)