i am using a custom loss trying to decrease peak average power ratio of ofdm symbols. to break it down the input is of length N length that can take only 4 values. the output can take any floating value from [-1,1] (because i cant go over the power threshold). i generate the training and validation set randomly since it is.. the data can take any random combination of the 4 values.
The problem is changing and tweaking the model and parameters only improves the training loss, validation loss is constant from the first epoch.
I am using a custom loss function that only concatenates the output of the model and spreads it in the middle of the input and using ifft operation then calculating the max / mean of all elements.
in short its reserving some of the array elements (tones) to pick so that it removes the peaks of the input sacrificing those element but getting less peaks in the final signal.
i am sending input data as one hot encoded for each of the 4 values, and sending them once more as labels in their complex form so i can do operations on them in the custom loss function below.
def PAPR_Loss(y_true, y_pred):
Reserved_phases = [0, 32, 62, 93, 124, 155, 186, 217, 248]
data = tf.concat([tf.concat([y_true[:, Reserved_phases[i]:Reserved_phases[i+1]], tf.complex(y_pred[:, 4*(i+1)-4] - y_pred[:, 4*(i+1)-2], y_pred[:, 4*(i+1)-3] - y_pred[:, 4*(i+1)-1])[:, tf.newaxis]], 1) for i in range(L)], 1)
x = tf.signal.ifft(data)
temp = tf.square(tf.abs(x))
loss = tf.reduce_max(temp, axis=-1) / tf.reduce_mean(temp, axis=-1)
return 10*tf.experimental.numpy.log10(loss)
Loss and Validation Loss vs Epochs
i am using 80k unique data combinations as training and 20k different combinations as validation
also i am using dropout after each layer so i dont think its an overfitting problem.
when i remove the tanh activation at the output (meaning the output can take any values) i start getting improvements on the validation and better loss on training as well but i suspect this occurs because we just let the model add the mean power term which is inversly proportional to the loss but it doesnt learn where the peaks and how to cancel those peaks. it just increase the mean as much as possible so that the max isnt that big in relation to it anymore.
also could the model not train because of the concatenation and using input in a different form as a label? i thought i could get away with this since the input isnt trainable so it doesnt matter.
Note: The model doesnt even beat the classical method without using deep learning which just search in a candidate limited set for the best combinations that decrease this peaks. the problem with the classical model that it is computationally expensive if i can even match this performance this approach will be very rewarding.
what could be going wrong here? what can i try changing next?
Thanks in advance.
I have a data-set without labels, but I do have a way to get pairs of examples with opposite labels, that is given a pair x,z I know that their true labels are either 0,1 or 1,0.
So, I am building a model that accepts pairs of samples as input, and learns to classify them with opposite labels. Assuming I have an arbitrary model for predicting a single sample, y_hat = f(x), I am building a model with Keras that accepts pairs of samples (x,z) and outputs pairs of predictions, f(x), f(z). I then use a custom loss function that drives the model towards the correct direction: Given that a regular binary classifier is trained using the Binary Cross Entropy (BCE) to make the predicted and desired output "close", I use the negative BCE. Also, since BCE is not symmetric, I symmetrize it. So, the loss function I give the model.compile method is:
from tensorflow import keras
bce = keras.losses.BinaryCrossentropy()
def neg_sym_bce(y1, y2):
return (- 0.5 * (bce(y1, y2) + bce(y2, y1)))
My problem is, this model fails to learn to classify even a single pair of my data (I get f(x)~=f(z)~=0.5), and if I try to train it with synthetic "easy" data, it takes hundreds of epochs to converge (also on a single pair).
This made me suspect that it has to do with a "vanishing gradient" problem. Indeed, when I plot (see below) the loss for a single pair, which is a function of 2 variables (the 2 outputs), it is evident that there is a wide plateau around the 0.5, 0.5 point. It is also evident that the global minima is, as expected, around the points 0,1 and 1,0.
So, is there a way to deal with the vanishing gradient here? I read about the problem but the references I found deal with vanishing gradient in the network, not in the loss itself.
Or, is there another loss that can drive the model to predict opposite labels?
Think if your labels are always either 0,1 or 1,1 just use categorical_crossentropy for the loss.
I've been running into an issue lately trying to train a simple MLP.
I'm basically trying to get a network to map the XYZ position and RPY orientation of the end-effector of a robot arm (6-dimensional input) to the angle of every joint of the robot arm to reach that position (6-dimensional output), so this is a regression problem.
I've generated a dataset using the angles to compute the current position, and generated datasets with 5k, 500k and 500M sets of values.
My issue is the MLP I'm using doesn't learn anything at all. Using Tensorboard (I'm using Keras), I've realized that the output of my very first layer is always zero (see image 1), no matter what I try.
Basically, my input is a shape (6,) vector and the output is also a shape (6,) vector.
Here is what I've tried so far, without success:
I've tried MLPs with 2 layers of size 12, 24; 2 layers of size 48, 48; 4 layers of size 12, 24, 24, 48.
Adam, SGD, RMSprop optimizers
Learning rates ranging from 0.15 to 0.001, with and without decay
Both Mean Squared Error (MSE) and Mean Absolute Error (MAE) as the loss function
Normalizing the input data, and not normalizing it (the first 3 values are between -3 and +3, the last 3 are between -pi and pi)
Batch sizes of 1, 10, 32
Tested the MLP of all 3 datasets of 5k values, 500k values and 5M values.
Tested with number of epoches ranging from 10 to 1000
Tested multiple initializers for the bias and kernel.
Tested both the Sequential model and the Keras functional API (to make sure the issue wasn't how I called the model)
All 3 of sigmoid, relu and tanh activation functions for the hidden layers (the last layer is a linear activation because its a regression)
Additionally, I've tried the very same MLP architecture on the basic Boston housing price regression dataset by Keras, and the net was definitely learning something, which leads me to believe that there may be some kind of issue with my data. However, I'm at a complete loss as to what it may be as the system in its current state does not learn anything at all, the loss function just stalls starting on the 1st epoch.
Any help or lead would be appreciated, and I will gladly provide code or data if needed!
Thank you
Here's a link to 5k samples of the data I'm using. Columns B-G are the output (angles used to generate the position/orientation) and columns H-M are the input (XYZ position and RPY orientation). https://drive.google.com/file/d/18tQJBQg95ISpxF9T3v156JAWRBJYzeiG/view
Also, here's a snippet of the code I'm using:
df = pd.read_csv('kinova_jaco_data_5k.csv', names = ['state0',
states = np.asarray(
[df.state0.to_numpy(), df.state1.to_numpy(), df.state2.to_numpy(), df.state3.to_numpy(), df.state4.to_numpy(),
poses = np.asarray(
[df.pose0.to_numpy(), df.pose1.to_numpy(), df.pose2.to_numpy(), df.pose3.to_numpy(), df.pose4.to_numpy(),
x_train_temp, x_test, y_train_temp, y_test = train_test_split(poses, states, test_size=0.2)
x_train, x_val, y_train, y_val = train_test_split(x_train_temp, y_train_temp, test_size=0.2)
mean = x_train.mean(axis=0)
x_train -= mean
std = x_train.std(axis=0)
x_train /= std
x_test -= mean
x_test /= std
x_val -= mean
x_val /= std
n_epochs = 100
n_units=[48, 48]
inputs = Input(shape=(6,), dtype= 'float32', name = 'input')
x = Dense(units=n_units[0], activation=relu, name='dense1')(inputs)
for i in range(1, n_hidden_layers):
x = Dense(units=n_units[i], activation=activation, name='dense'+str(i+1))(x)
out = Dense(units=6, activation='linear', name='output_layer')(x)
model = Model(inputs=inputs, outputs=out)
optimizer = SGD(lr=0.1, momentum=0.4)
model.compile(optimizer=optimizer, loss='mse', metrics=['mse', 'mae'])
history = model.fit(x_train,
validation_data=(x_test, y_test),
Edit 2
I've tested the architecture with a random dataset where the input was a (6,) vector where input[i] is a random number and the output was a (6,) vector with output[i] = input[i]² and the network didn't learn anything. I've also tested a random dataset where the input was a random number and the output was a linear function of the input, and the loss converged to 0 pretty quickly. In short, it seems the simple architecture is unable to map a non-linear function.
the output of my very first layer is always zero.
This typically means that the network does not "see" any pattern in the input at all, which causes it to always predict the mean of the target over the entire training set, regardless of input. Your output is in the range of -𝜋 to 𝜋 probably with an expected value of 0, so it checks out.
My guess is that the model is too small to represent the data efficiently. I would suggest that you increase the number of parameters in the model by a factor of 10 or 100 and see if it starts seeing something. Limiting the number of parameters has a regularizing effect on the network, and strong regularization usually leads the the aforementioned derping to the mean.
I'm by no means a robotics expert, but I guess that there are a lot of situations where a small nudge in the output parameters causes a large change of the input. Let's say I'm trying to scratch my back with my left hand - the farther my hand goes to the left, the harder the task becomes, so at some point I might want to switch hands, which is a discontinuous configuration change. A bad analogy, sure, but I hope it demonstrates my hunch that there are certain places in the configuration space where small target changes cause large configuration changes.
Such large changes will cause a very large, very noisy gradient around those points. I'm not sure how well the network will work around these noisy gradients, but I would suggest as an experiment that you try to limit the training dataset to a set of outputs that are connected smoothly to one another in the configuration space of the arm, if that makes sense. Going further, you should remove any points from the dataset that are close to such configuration boundaries. To make up for that at inference time, you might instead want to sample several close-by points and choose the most common prediction as the final result. Hopefully some of those points will land in a smooth configuration area.
Also, adding batch normalization before each dense layer will help smooth the gradient and provide for more reliable training.
As for the rest of your hyperparameters:
A batch size of 32 is good, a very small batch size will make the gradient too noisy
The loss function is not critical, both MSE and MAE should work
The activation functions aren't critical, ReLU is a good default choice.
The default initializers a good enough.
Normalizing is important for Dense layers, so keep it
Train for as many epochs as you need as long as both the training and validation loss are dropping. If the validation loss hasn't dropped for 5-10 epochs you might as well stop early.
Adam is a good default choice. Start with a small learning rate and increase the learning rate at the beginning of training only if the training loss is dropping consistently over several epochs.
Further reading: 37 Reasons why your Neural Network is not working
I ended up replacing the first dense layer with a Conv1D layer and the network now seems to be learning decently. It's overfitting to my data, but that's territory I'm okay with.
I'm closing the thread for now, I'll spend some time playing with the architecture.
I am using autoencoders to do anomaly detection. So, I have finished training my model and now I want to calculate the reconstruction loss for each entry in the dataset. so that I can assign anomalies to data points with high reconstruction loss.
This is my current code to calculate the reconstruction loss
But this is really slow. By my estimation, it should take 5 hours to go through the dataset whereas training one epoch occurs in approx 55 mins.
I feel that converting to tensor operation is bottlenecking the code, but I can't find a better way to do it.
I've tried changing the batch sizes but it does not make much of a difference. I have to use the convert to tensor part because K.eval is throwing an error if I do it normally.
for i in range(0, encoded_dataset.shape[0], batch_size):
y_true = tf.convert_to_tensor(encoded_dataset[i:i+batch_size].values,
y_pred= tf.convert_to_tensor(ae1.predict(encoded_dataset[i:i+batch_size].values),
# Append the batch losses (numpy array) to the list
reconstruction_loss_transaction.append(K.eval(loss_function( y_true, y_pred)))
I was able to train in 55 mins per epoch. So I feel prediction should not take 5 hours per epoch. encoded_dataset is a variable that has the entire dataset in main memory as a data frame.
I am using Azure VM instance.
K.eval(loss_function(y_true,y_pred) is to find the loss for each row of the batch
So y_true will be of size (batch_size,2000) and so will y_pred
K.eval(loss_function(y_true,y_pred) will give me an output of
(batch_size,1) evaluating binary cross entropy on each row of y
_true and y_pred
Moved from comments:
My suspicion is that ae1.predict and K.eval(loss_function) are behaving in unexpected ways. ae1.predict should normally be used to output the loss function value as well as y_pred. When you create the model, specify that the loss value is another output (you can have a list of multiple outputs), then just call predict here once to get both y_pred the loss value in one call.
But I want the loss for each row . Won't the loss returned by the predict method be the mean loss for the entire batch?
The answer depends on how the loss function is implemented. Both ways produce perfectly valid and identical results in TF under the hood. You could average the loss over the batch before taking the gradient w.r.t. the loss, or take the gradient w.r.t. a vector of losses. The gradient operation in TF will perform the averaging of the losses for you if you use the latter approach (see SO articles on taking the per-sample gradient, it's actually hard to do).
If Keras implements the loss with reduce_mean built into the loss, you could just define your own loss. If you're using square loss, replacing 'mean_squared_error' with lambda y_true, y_pred: tf.square(y_pred - y_true). That would produce square error instead of MSE (no difference to the gradient), but look here for the variant including the mean.
In any case this produces a per sample loss so long as you don't use tf.reduce_mean, which is purely optional in the loss. Another option is to simply compute the loss separately from what you optimize for and make that an output of the model, also perfectly valid.
I am currently working with Keras using Tensorflow as the backend. I have a LSTM Sequence Prediction model shown below that I am using to predict one step ahead in a data series (input 30 steps [each with 4 features], output predicted step 31).
model = Sequential()
model.compile(loss="mse", optimizer="rmsprop")
return model
The issue I'm having is that after training the model and testing it - even with the same data it trained on - what it outputs is essentially the 30th step in the input. My first thought is the patterns of my data must be too complex to accurately predict, at least with this relatively simple model, so the best answer it can return is essentially the last element of the input. To limit the possibility of over-fitting I've tried turning training epochs down to 1 but the same behavior appears. I've never observed this behavior before though and I have worked with this type of data before with successful results (for context, I'm using vibration data taken from 4 points on a complex physical system that has active stabilizers; the prediction is used in a pid loop for stabilization hence why, at least for now, I'm using a simpler model to keep things fast).
Does that sound like the most likely cause, or does anyone have another idea? Has anyone seen this behavior before? In case it helps with visualization here is what the prediction looks like for one vibration point compared to the desired output (note, these screenshots are zoomed in smaller selections of a very large dataset - as #MarcinMożejko noticed I did not zoom quite the same both times so any offset between the images is due to that, the intent is to show the horizontal offset between the prediction and true data within each image):
...and compared to the 30th step of the input:
Note: Each data point seen by the Keras model is an average over many actual measurements with the window of the average processed along in time. This is done because the vibration data is extremely chaotic at the smallest resolution I can measure so instead I use this moving average technique to predict the larger movements (which are the more important ones to counteract anyway). That is why the offset in the first image appears as many points off instead of just one, it is 'one average' or 100 individual points of offset.
-----Edit 1, code used to get from the input datasets 'X_test, y_test' to the plots shown above-----
model_1 = lstm.build_model() # The function above, pulled from another file 'lstm'
prediction = model_1.predict(X_test)
temp_predicted_sensor_b = (prediction[:, 0] + 1) * X_b_orig[:, 0]
sensor_b_y = (Y_test[:, 0] + 1) * X_b_orig[:, 0]
plot_results(temp_predicted_sensor_b, sensor_b_y)
plot_results(temp_predicted_sensor_b, X_b_orig[:, 29])
For context:
X_test.shape = (41541, 30, 4)
Y_test.shape = (41541, 4)
X_b_orig is the raw (averaged as described above) data from the b sensor. This is multiplied by the prediction and input data when plotting to undo normalization I do to improve the prediction. It has shape (41541, 30).
----Edit 2----
Here is a link to a complete project setup to demonstrate this behavior:
That is because for your data(stock data?), the best prediction for 31st value is the 30th value itself.The model is correct and fits the data.
I also have similar experience predicting the stock data.
I feel I should post a follow-up, since it seems this post has been getting more attention than my other questions.
Ferret Zhang's answer is correct (and has been accepted), and I find this discovery is actually quite funny when you understand it in relation to stock / cryptocurrency data which some have commented about. What sequence prediction is ultimately doing is assigning statistical weights to different moves, to pick the highest probability move and 'predict' it will happen. In the case of stock data, in a vacuum it is (at least at this scale) completely random, there is equal probability of moving up or down, and hence the model predicts that it will stay the exact same.
The model, in a sense, learned that the best way to play is to not play at all :)
I'm trying to use the SSD between two images as loss function for my network.
# h_fc2 is my output layer, y_ is my label image.
ssd = tf.reduce_sum(tf.square(y_ - h_fc2))
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(ssd)
Problem is, that the weights then diverge and I get the error
ReluGrad input is not finite. : Tensor had Inf values
Why's that? I did try some other stuff like normalizing the ssd by the image size (did not work) or cropping the output values to 1 (does not crash anymore, but I still need to evaluate this):
ssd_min_1 = tf.reduce_sum(tf.square(y_ - tf.minimum(h_fc2, 1)))
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(ssd_min_1)
Are my observations to be expected?
#mdaoust suggestions proved to be correct. The main point was normalizing by batch size. This can be done independent of batch size by using this code
squared_diff_image = tf.square(label_image - output_img)
# Sum over all dimensions except the first (the batch-dimension).
ssd_images = tf.reduce_sum(squared_diff_image, [1, 2, 3])
# Take mean ssd over batch.
error_images = tf.reduce_mean(ssd_images)
With this change, only a slight decrease of the learning rate (to 0.0001) was necessary.
There are a lot of ways you can end up with non-finite results.
But optimizers, especially simple ones like gradient descent, can diverge if the learning rate is 'too high'.
Have you tried simply dividing your learning rate by 10/100/1000? Or normalizing by pixels*batch_size to get the average error per pixel?
Or one of the more advanced optimizers? For example tf.train.AdamOptimizer() with default options.