I am training a model(VAEGAN) with intermediate outputs and I have two losses,
KL Divergence loss I compute from output layer
Similarity (rec) loss I compute from an intermediate layer.
Can I simply sum them up and apply gradients like below?
with tf.GradientTape() as tape:
z_mean, z_log_sigma, z_encoder_output = self.encoder(real_images, training = True)
kl_loss = self.kl_loss_fn(z_mean, z_log_sigma) * kl_loss_coeff
fake_images = self.decoder(z_encoder_output)
fake_inter_activations, logits_fake = self.discriminator(fake_images, training = True)
real_inter_activations, logits_real = self.discriminator(real_images, training = True)
rec_loss = self.rec_loss_fn(fake_inter_activations, real_inter_activations) * rec_loss_coeff
total_encoder_loss = kl_loss + rec_loss
grads = tape.gradient(total_encoder_loss, self.encoder.trainable_weights)
self.e_optimizer.apply_gradients(zip(grads, self.encoder.trainable_weights))
or do I need to seperate them like below while keeping tape persistent?
with tf.GradientTape(persistent = True) as tape:
z_mean, z_log_sigma, z_encoder_output = self.encoder(real_images, training = True)
kl_loss = self.kl_loss_fn(z_mean, z_log_sigma) * kl_loss_coeff
fake_images = self.decoder(z_encoder_output)
fake_inter_activations, logits_fake = self.discriminator(fake_images, training = True)
real_inter_activations, logits_real = self.discriminator(real_images, training = True)
rec_loss = self.rec_loss_fn(fake_inter_activations, real_inter_activations) * rec_loss_coeff
grads_kl_loss = tape.gradient(kl_loss, self.encoder.trainable_weights)
self.e_optimizer.apply_gradients(zip(grads_kl_loss, self.encoder.trainable_weights))
grads_rec_loss = tape.gradient(rec_loss, self.encoder.trainable_weights)
self.e_optimizer.apply_gradients(zip(grads_rec_loss, self.encoder.trainable_weights))
Yes, you can generally sum the losses and compute a single gradient. Since the gradient of a sum is the sum of the respective gradients, so the step taken by the summed loss is the same as taking both steps one after another.
Here's a simple example: Say you have two weights, and you are currently at the point (1, 3) ("starting point"). The gradient for loss 1 is (2, -4) and the gradient for loss 2 is (1, 2).
If you apply the steps one after the other, you will first move to (3, -1) and then to (4, 1).
If you sum the gradients first, the overall step is (3, -2). Following this direction from the starting point gets you to (4, 1) as well.
Related
I have created a transformer model for multivariate time series predictions for a linear regression problem.
Details about the Dataset
I have the hourly varying data i.e., single feature (lagged energy use data). The model improvement could be done by increasing the number of lagged energy use data, which provide more information to the model) to predict the time sequence (energy consumption of a building). So my input has the shape X.shape = (8783, 168, 1) i.e., 8783 time sequences, each sequence contains lagged energy use data of one week i.e., 24*7 =168 hourly entries/vectors and each vector contains lagged energy use data as input. My output has the shape Y.shape = (8783,1) i.e., 8783 sequences each containing 1 output value (i.e., building energy consumption value after every hour).
Model Details
I took as a model an example from the official keras site. It is created for classification problems, I modified it for my regression problem by changing the activation of last output layer from sigmoid to relu. Input shape (train_f) = (8783, 168, 1) Output shape (train_P) = (8783,1) When I trained the model for 100 no. of epochs it converges very well for less number of epochs as compared to my reference models (i.e., LSTMs and LSTMS with self attention). After training, when the model is asked to make prediction by feeding in the test data, the prediction performance is also good as compare to the reference models.
For the same model predicting well, in order to improve its performance now I am feeding in the lagged data of energy use of 1 month i.e., 168*4 = 672 hourly entries/vectors and each vector contains lagged energy use data as input. So my input going into the model now has the shape X.shape = (8783, 672, 1). Both the training and prediction accuracy drops in comparison to weekly input data as seen below.
**lagged energy use data for 1 week i.e., X.shape = (8783, 168, 1)**
**MSE RMSE MAE R-Score**
Training data 1.0489 1.0242 0.6395 0.9707
Testing data 0.6221 0.7887 0.5648 0.9171
**lagged energy use data for 1 week i.e., X.shape = (8783, 672, 1)**
**MSE RMSE MAE R-Score**
Training data 1.6424 1.2816 0.7326 0.9567
Testing data 1.4991 1.2244 0.9233 0.6903
I believe that providing more information to the model should result in better predictions. Any suggestions, how to improve the model prediction/test accuracy? Is there something wrong with the model?
df_energy = pd.read_excel("/content/drive/MyDrive/Architecture Topology/Building_energy_consumption_record.xlsx")
extract_for_normalization = list(df_energy)[1]
df_data_float = df_energy[extract_for_normalization].astype(float)
df_data_array = df_data_float.to_numpy()
df_data_array_1 = df_data_array.reshape(-1,1)
from sklearn.model_selection import train_test_split
train_X, test_X = train_test_split(df_data_array_1, train_size = 0.7, shuffle = False)
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_train_X=scaler.fit_transform(train_X)
**Converting train_X into required shape (inputs,sequences, features)**
train_f = [] #features input from training data
train_p = [] # prediction values
n_future = 1 #number of days we want to predict into the future
n_past = 672 # no. of time series input features to be considered for training
for val in range(n_past, len(scaled_train_X) - n_future+1):
train_f.append(scaled_train_X[val - n_past:val, 0:scaled_train_X.shape[1]])
train_p.append(scaled_train_X[val + n_future - 1:val + n_future, -1])
train_f, train_p = np.array(train_f), np.array(train_p)
**Transformer Model**
def transformer_encoder(inputs, head_size, num_heads, ff_dim, dropout=0):
# Normalization and Attention
x = layers.LayerNormalization(epsilon=1e-6)(inputs)
x = layers.MultiHeadAttention(
key_dim=head_size, num_heads=num_heads, dropout=dropout
)(x, x)
x = layers.Dropout(dropout)(x)
res = x + inputs
# Feed Forward Part
x = layers.LayerNormalization(epsilon=1e-6)(res)
x = layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(x)
x = layers.Dropout(dropout)(x)
x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
return x + res
def build_model(
input_shape,
head_size,
num_heads,
ff_dim,
num_transformer_blocks,
mlp_units,
dropout=0,
mlp_dropout=0,
):
inputs = keras.Input(shape=input_shape)
x = inputs
for _ in range(num_transformer_blocks):
x = transformer_encoder(x, head_size, num_heads, ff_dim, dropout)
x = layers.GlobalAveragePooling1D(data_format="channels_first")(x)
for dim in mlp_units:
x = layers.Dense(dim, activation="relu")(x)
x = layers.Dropout(mlp_dropout)(x)
outputs = layers.Dense(train_p.shape[1])(x)
return keras.Model(inputs, outputs)
input_shape = (train_f.shape[1], train_f.shape[2])
model = build_model(
input_shape,
head_size=256,
num_heads=4,
ff_dim=4,
num_transformer_blocks=4,
mlp_units=[128],
mlp_dropout=0.4,
dropout=0.25,
)
model.compile(loss=tf.keras.losses.mean_absolute_error,
optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
metrics=["mse"])
model.summary()
history = model.fit(train_f, train_p, epochs=100, batch_size = 32, validation_split = 0.25, verbose = 1)
trainYPredict = model.predict(train_f)
**Inverse transform the prediction and keep the last value(output)**
trainYPredict1 = np.repeat(trainYPredict, scaled_train_X.shape[1], axis = -1)
trainYPredict_actual = scaler.inverse_transform(trainYPredict1)[:, -1]
train_p_actual = np.repeat(train_p, scaled_train_X.shape[1], axis = -1)
train_p_actual1 = scaler.inverse_transform(train_p_actual)[:, -1]
Prediction_mse=mean_squared_error(train_p_actual1 ,trainYPredict_actual)
print("Mean Squared Error of prediction is:", str(Prediction_mse))
Prediction_rmse =sqrt(Prediction_mse)
print("Root Mean Squared Error of prediction is:", str(Prediction_rmse))
prediction_r2=r2_score(train_p_actual1 ,trainYPredict_actual)
print("R2 score of predictions is:", str(prediction_r2))
prediction_mae=mean_absolute_error(train_p_actual1 ,trainYPredict_actual)
print("Mean absolute error of prediction is:", prediction_mae)
**Testing of model**
scaled_test_X = scaler.transform(test_X)
test_q = []
test_r = []
for val in range(n_past, len(scaled_test_X) - n_future+1):
test_q.append(scaled_test_X[val - n_past:val, 0:scaled_test_X.shape[1]])
test_r.append(scaled_test_X[val + n_future - 1:val + n_future, -1])
test_q, test_r = np.array(test_q), np.array(test_r)
testPredict = model.predict(test_q)
:)
I'm new to deep learning and I'm trying to create a Yolo v1 detection model from scratch with keras and tensorflow.
I need to detect only one class so I've made the network have an output size of 7x7x5 (7 by 7 grid and one bbox per cell [P(object) x y w h]
I've created the dataset and resized all the images to 448x448 and I've assigned the bounding boxes, normalized as in the paper, to the corresponding cell.
My custom loss doesn't take into account the IoU to get the best box, indeed there's only one per cell to choose from.
Furthermore it considers only the coordinate loss and the object-no object losses as there's only one class.
Here's how I implemented it with Keras
def custom_loss_tipregovai(y_true, y_pred):
mse = tf.keras.losses.MeanSquaredError(reduction = "sum")
predictions = tf.reshape(y_pred,(-1,7,7,5))
exists_box = tf.expand_dims(y_true[...,0], 3)
#BOX LOSS
pred_box = exists_box*predictions[...,1:5]
target_box = exists_box*y_true[...,1:5]
epsilon = tf.fill(pred_box[..., 2:4].shape, 1e-6)
wh_pred = tf.math.sign(pred_box[...,3:5]) * tf.math.sqrt(tf.math.abs(pred_box[...,3:5] + epsilon))
wh_targ = tf.math.sqrt(target_box[...,3:5] + epsilon)
xy_pred = pred_box[...,1:3]
xy_true = target_box[...,1:3]
final_pred_box = tf.concat([xy_pred,wh_pred], axis = 3)
final_true_box = tf.concat([xy_true,wh_targ], axis = 3)
box_loss = mse(tf.reshape(final_pred_box, (-1, final_pred_box.shape[-1])),tf.reshape(final_true_box, (-1, final_true_box.shape[-1])))
#OBJECT LOSS
pred_obj = predictions[...,0:1]
true_obj = y_true[...,0:1]
object_loss = mse(tf.reshape(exists_box*pred_obj, (-1, )), tf.reshape(exists_box*true_obj, (-1, )) )
#NO OBJECT LOSS
non_exists_box = 1 - exists_box
no_object_loss = mse(tf.reshape(non_exists_box*pred_obj, (-1, )), tf.reshape(non_exists_box*true_obj, (-1, )))
total_loss = 5*box_loss + object_loss + 0.5*no_object_loss
return total_loss
I really don't understand why :(
Any help is appreciated, thank you!!
I have run the keras-tuner with hundreds of configurations, my model (Regression) always seems to produce results close to the target value mean (0 in this case as i'm predicting price percent change)
I have validated and checked the inputs, outputs and scaling of inputs many times.
Dataset is around 3000 rows
Keras-Tuner optimized using following configs
Epochs tested (50,250,500,1000).
Batch Sizes tested: 64,128,256,512
SEQ LEN tested (12,24,60)
lstm_layers_count = hp.Int(name = "lstm_layers_count",min_value= 1, max_value = 4, step = 1)
layers_lstm_units = []
layers_dropout = []
for i in range(lstm_layers_count):
layers_lstm_units.append(hp.Int(name = f"lstm_layer_{i}_units", min_value=32, max_value=288, step=64))
layers_dropout.append(hp.Float(name = f"layer_{i}_dropout",min_value = 0.0, max_value = 0.3, step = 0.1))
final_dense = hp.Boolean(name = "final_dense")
final_dense_units = hp.Int(name = "final_dense_units",min_value = 20,max_value = 150, step = 30)
learning_rate = hp.Float(name = "learning_rate", min_value=1e-4, max_value=1e-2, sampling="log")
relu_enabled = hp.Boolean(name = "relu_enabled")
The LSTM takes as input:
Price (As percent change)
Volume (As percent change)
EMA signal
RSI signal
All inputs are standardised using the MinMaxScaler(feature_range = (-1,1)), also tried StandardScaler()
Output:
Pct change from index date price to next day price
LSTM Simplified for readability:
#All configs and layers count have been hyper-optimized as mentioned above,
#this is using a sample of one of the best config result.
model = tf.keras.Sequential()
model.add(LSTM(92,input_shape = (SEQ_LEN, feature_number), return_sequences= True))
model.add(Dropout(0.2))
model.add(LSTM(92, return_sequences= True))
model.add(Dropout(0.2))
model.add(LSTM(32, return_sequences= False))
model.add(Dropout(0.2))
model.add(Dense(1))
optimization = tf.keras.optimizers.Adam(learning_rate= learning_rate )
# Compile model
model.compile(
loss= custom_loss_function,
optimizer= optimization,
metrics = [custom_loss_function],
)
#Also tried using MSE loss function
def custom_loss_function(y_pred,y_true):
alpha = 2.0
loss = (y_pred - y_true) ** 2.0
adj = tf.math.multiply(y_pred,y_true)
adj = tf.where(tf.greater(adj, 0.0), tf.constant(1/alpha), adj)
adj = tf.where(tf.less(adj, 0.0), tf.constant(alpha), adj)
loss = loss * adj
return tf.reduce_mean(loss)
I also check the predictions even if very close to 0 if they match the correct direction of the target (Positive pct change or negative) and results are near random
I am new with Deep Learning with Pytorch. I am more experienced with Tensorflow, and thus I should say I am not new to Deep Learning itself.
Currently, I am working on a simple ANN classification. There are only 2 classes so quite naturally I am using a Softmax BCELoss combination.
The dataset is like this:
shape of X_train (891, 7)
Shape of Y_train (891,)
Shape of x_test (418, 7)
I transformed the X_train and others to torch tensors as train_data and so on. The next step is:
train_ds = TensorDataset(train_data, train_label)
# Define data loader
batch_size = 32
train_dl = DataLoader(train_ds, batch_size, shuffle=True)
I made the model class like:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# an affine operation: y = Wx + b
self.fc1 = nn.Linear(7, 32)
self.bc1 = nn.BatchNorm1d(32)
self.fc2 = nn.Linear(32, 64)
self.bc2 = nn.BatchNorm1d(64)
self.fc3 = nn.Linear(64, 128)
self.bc3 = nn.BatchNorm1d(128)
self.fc4 = nn.Linear(128, 32)
self.bc4 = nn.BatchNorm1d(32)
self.fc5 = nn.Linear(32, 10)
self.bc5 = nn.BatchNorm1d(10)
self.fc6 = nn.Linear(10, 1)
self.bc6 = nn.BatchNorm1d(1)
self.drop = nn.Dropout2d(p=0.5)
def forward(self, x):
torch.nn.init.xavier_uniform(self.fc1.weight)
x = self.fc1(x)
x = self.bc1(x)
x = F.relu(x)
x = self.drop(x)
x = self.fc2(x)
x = self.bc2(x)
x = F.relu(x)
#x = self.drop(x)
x = self.fc3(x)
x = self.bc3(x)
x = F.relu(x)
x = self.drop(x)
x = self.fc4(x)
x = self.bc4(x)
x = F.relu(x)
#x = self.drop(x)
x = self.fc5(x)
x = self.bc5(x)
x = F.relu(x)
x = self.drop(x)
x = self.fc6(x)
x = self.bc6(x)
x = torch.sigmoid(x)
return x
model = Net()
The loss function and the optimizer are defined:
loss = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.00001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
At last, the task is to run the forward in epochs:
num_epochs = 1000
# Repeat for given number of epochs
for epoch in range(num_epochs):
# Train with batches of data
for xb,yb in train_dl:
pred = model(xb)
yb = torch.unsqueeze(yb, 1)
#print(pred, yb)
print('grad', model.fc1.weight.grad)
l = loss(pred, yb)
#print('loss',l)
# 3. Compute gradients
l.backward()
# 4. Update parameters using gradients
optimizer.step()
# 5. Reset the gradients to zero
optimizer.zero_grad()
# Print the progress
if (epoch+1) % 10 == 0:
print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, l.item()))
I can see in the output that after each iteration with all the batches, the hard weights are non-zero, after this zero_grad is applied.
However, the model is pretty bad. I get an F1 score of around 50% only! And the model is bad when I call it to predict the train_dl itself!!!
I am wondering what the reason is. The grad of weights not zero but not updating properly? The optimizer not optimizing the weights? Or what else?
Can someone please have a look?
I already tried different loss functions and optimizers. I tried with smaller datasets, bigger batches, different hyperparameters.
Thanks! :)
First of all, you don't use softmax activation for BCE loss, unless you have 2 output nodes, which is not the case. In PyTorch, BCE loss doesn't apply any activation function before calculating the loss, unlike the CCE which has a built-in softmax function. So, if you want to use BCE, you have to use sigmoid (or any function f: R -> [0, 1]) at the output layer, which you don't have.
Moreover, you should ideally do optimizer.zero_grad() for each batch if you want to do SGD (which is the default). If you don't do that, you will be just doing full-batch gradient descent, which is quite slow and gets stuck in local minima easily.
I have some neural network with following code snippets, note that batch_size == 1 and input_dim == output_dim:
net_in = tf.Variable(tf.zeros(shape = [batch_size, input_dim]), dtype=tf.float32)
input_placeholder = tf.compat.v1.placeholder(shape = [batch_size, input_dim], dtype=tf.float32)
assign_input = net_in.assign(input_placeholder)
# Some matmuls, activations, dropouts, normalizations...
net_out = tf.tanh(output_before_activation)
def loss_fn(output, input):
#input.shape = output.shape = (batch_size, input_dim)
output = tf.reshape(output, [input_dim,]) # shape them into 1d vectors
input = tf.reshape(input, [input_dim,])
return my_fn_that_only_takes_in_vectors(output, input)
# Create session, preprocess data ...
for epoch in epoch_num:
for batch in range(total_example_num // batch_size):
sess.run(assign_input, feed_dict = {input_placeholder : some_appropriate_numpy_array})
sess.run(optimizer.minimize(loss_fn(net_out, net_in)))
Currently the neural network above works fine, but it is very slow because it updates gradient every sample (batch size = 1). I would like to set batch size > 1, but my_fn_that_only_takes_in_vectors cannot accommodate matrices whose first dimension is not 1. Due to the nature of my custom loss, flattening the batch input into a vector of length (batch_size * input_dim) seems to not work.
How would I write my new custom loss_fn now that the input and output are N x input_dim where N > 1? In Keras this would not have been an issue because keras somehow takes the average of the gradients of each example in the batch. For my TensorFlow function, should I take each row as a vector individually, pass them to my_fn_that_only_takes_in_vectors, then take the average of the results?
You can use a function that computes the loss on the whole batch, and works independently on the batch size. Basically the operations are applied to the whole first dimension of the input (the first dimension represents the element number in the batch). Here is an example, I hope this helps to see how the operations are carried out:
def my_loss(y_true, y_pred):
dx2 = tf.math.squared_difference(y_true[:, 0], y_true[:, 2]) # shape (BatchSize, )
dy2 = tf.math.squared_difference(y_true[:, 1], y_true[:, 3]) # shape: (BatchSize, )
denominator = dx2 + dy2 # shape: (BatchSize, )
dst_vec = tf.math.squared_difference(y_true, y_pred) # shape: (Batch, n_labels)
numerator = tf.reduce_sum(dst_vec, axis=-1) # shape: (BatchSize,)
loss_vector = tf.cast(numerator / denominator, dtype="float32") # shape: (BatchSize,) this is a vector containing the loss of each element of the batch
loss = tf.reduce_sum(loss_vector ) #if you want to sum the losses
return loss
I am not sure whether you need to return the sum or the avg of the losses for the batch.
If you sum, make sure to use a validation dataset with same batch size, otherwise the loss is not comparable.