My network has 2 outputs. I'm trying to have a loss on two terms that is not a linear sum of two losses:
def weightedBCE(y_true, y_pred):
assert y_pred.shape[2] == 2
y_pred_val = y_pred[:,:,0]
stds = y_pred[:,:,1]
bce = K.binary_crossentropy(y_true, y_pred_val)
loss = bce * (1. + LAM*stds )
return loss
The final layers of my model are defined like this (outSall has 3 values):
std = make_std_model()(outSall)
final = Dense(1, activation="sigmoid")(outSall)
output = concatenate([DSAfinal, std ], axis=-1)
But it doesn't work because Kears expects 1 loss function per output. My loss uses both outputs of the network together.
The first output is a standard classification one with Binary Cross Entropy loss, but I want it to be multiplied by (1+ LAM* stds) with a lambda factor multiplying stds. stds are the second output of the network.
How can I do this?
assert y_pred.shape[2] == 2
IndexError: list index out of range
Update:
I had an extra index, now fixed. See below. But I get an error pasted below.
def weightedBCE(y_true, y_pred):
assert y_pred.shape[1] == 2
y_pred_val = y_pred[:,0]
stds = y_pred[:,1]
bce = K.binary_crossentropy(y_true, y_pred_val)
loss = bce * (1. + LAM*stds )
return loss
ValueError: logits and labels must have the same shape ((?,) vs (?, ?)
Update2:
Keras assumes the y_true has same shape as y_pred. Which was the problem. Changed the loss to:
def weightedBCE(y_true, y_pred):
assert y_pred.shape[1] == 2
y_pred_val = y_pred[:,0]
stds = y_pred[:,1]
bce = K.binary_crossentropy(y_true[:,0], y_pred_val)
loss = bce * (1. + LAM*stds )
return loss
There is still some problem with handling two outputs, see Binary Cross Entropy not giving similar results when I have 2 outputs
Instead of creating a Keras model with two outputs, create a Keras model with a single output which is a concatenation of the two tensors (you can use keras.layers.Concatenate for that). Then you can compile the model with a single custom loss function, as the one you wrote above.
Related
I've been back and forth with this for ages, but without being able to find a solution so far anywhere. So, I have a HuggingFace model ('bert-base-cased') that I'm using with TensorFlow and a custom dataset. I've: (1) tokenized my data (2) split the data; (3) converted the data to TF dataset format; (4) instantiated, compiled and fit the model.
During training, it behaves as you'd expect: training and validation accuracy go up. But when I evaluate the model on the test dataset using TF's model.evaluate and model.predict, the results are very different. The accuracy as reported by model.evaluate is higher (and more or less in line with the validation accuracy); the accuracy as reported by model.predict is about 10% lower. (Maybe it's just a coincidence, but it's similar to the reported training accuracy after the single epoch of fine-tuning.)
Can anyone figure out what's causing this? I include snippets of my code below.
# tokenize the dataset
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="bert-base-cased",use_fast=False)
def tokenize_function(examples):
return tokenizer(examples['text'], padding = "max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# splitting dataset
trainSize = 0.7
valTestSize = 1 - trainSize
train_testvalid = tokenized_datasets.train_test_split(test_size=valTestSize,stratify_by_column='class')
valid_test = train_testvalid['test'].train_test_split(test_size=0.5,stratify_by_column='class')
# renaming each of the datasets for convenience
train_set = train_testvalid['train']
val_set = valid_test['train']
test_set = valid_test['test']
# converting the tokenized datasets to TensorFlow datasets
data_collator = DefaultDataCollator(return_tensors="tf")
tf_train_dataset = train_set.to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
label_cols=['class'],
shuffle=True,
collate_fn=data_collator,
batch_size=8)
tf_validation_dataset = val_set.to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
label_cols=['class'],
shuffle=False,
collate_fn=data_collator,
batch_size=8)
tf_test_dataset = test_set.to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
label_cols=['class'],
shuffle=False,
collate_fn=data_collator,
batch_size=8)
# loading tensorflow model
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=1)
# compiling the model
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=5e-6),
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=tf.metrics.BinaryAccuracy())
# fitting model
history = model.fit(tf_train_dataset,
validation_data=tf_validation_dataset,
epochs=1)
# Evaluating the model on the test data using `evaluate`
results = model.evaluate(x=tf_test_dataset,verbose=2) # reports binary_accuracy: 0.9152
# first attempt at using model.predict method
hits = 0
misses = 0
for x, y in tf_test_dataset:
logits = tf.keras.backend.get_value(model(x, training=False).logits)
labels = tf.keras.backend.get_value(y)
for i in range(len(logits)):
if logits[i][0] < 0:
z = 0
else:
z = 1
if z == labels[i]:
hits += 1
else:
misses += 1
print(hits/(hits+misses)) # reports binary_accuracy: 0.8187
# second attempt at using model.predict method
modelPredictions = model.predict(tf_test_dataset).logits
testDataLabels = np.concatenate([y for x, y in tf_test_dataset], axis=0)
hits = 0
misses = 0
for i in range(len(modelPredictions)):
if modelPredictions[i][0] >= 0:
z = 1
else:
z = 0
if z == testDataLabels[i]:
hits += 1
else:
misses += 1
print(hits/(hits+misses)) # reports binary_accuracy: 0.8187
Things I've tried include:
different loss functions (it's a binary classification problem with the label column of the dataset filled with either a zero or a one for each row);
different ways of unpacking the test dataset and feeding it to model.predict;
altering the 'num_labels' parameter between 1 and 2.
I fixed the problem by changing the num_labels parameter to two and the loss function to sparse categorical cross entropy. (I then had to change my model.predict loop by taking the argmax of the two logits produced by the model.)
I have followed the tutorial available at: https://www.tensorflow.org/quantum/tutorials/mnist. I have modified this tutorial to the simplest example I could think of: an input set in which x increases linearly from 0 to 1 and y = x < 0.3. I then use a PQC with a single Rx gate with a symbol, and a readout using a Z gate.
When retrieving the optimized symbol and adjusting it manually, I can easily find a value that provides 100% accuracy, but when I let the Adam optimizer run, it converges to either always predict 1 or always predict -1. Does anybody spot what I do wrong? (and I apologize for not being able to break down the code to a smaller example)
import tensorflow as tf
import tensorflow_quantum as tfq
import cirq
import sympy
import numpy as np
# used to embed classical data in quantum circuits
def convert_to_circuit_cont(image):
"""Encode truncated classical image into quantum datapoint."""
values = np.ndarray.flatten(image)
qubits = cirq.GridQubit.rect(4, 1)
circuit = cirq.Circuit()
for i, value in enumerate(values):
if value:
circuit.append(cirq.rx(value).on(qubits[i]))
return circuit
# define classical dataset
length = 1000
np.random.seed(42)
# create a linearly increasing set for x from 0 to 1 in 1/length steps
x_train_sorted = np.asarray([[x/length] for x in range(0,length)], dtype=np.float32)
# p is used to shuffle x and y similarly
p = np.random.permutation(len(x_train_sorted))
x_train = x_train_sorted[p]
# y = x < 0.3 in {-1, 1} for Hinge loss
y_train_sorted = np.asarray([1 if (x/length)<0.30 else -1 for x in range(0,length)])
y_train = y_train_sorted[p]
# test == train for this example
x_test = x_train_sorted[:]
y_test = y_train_sorted[:]
# convert classical data into quantum circuits
x_train_circ = [convert_to_circuit_cont(x) for x in x_train]
x_test_circ = [convert_to_circuit_cont(x) for x in x_test]
x_train_tfcirc = tfq.convert_to_tensor(x_train_circ)
x_test_tfcirc = tfq.convert_to_tensor(x_test_circ)
# define the PQC circuit, consisting out of 1 qubit with 1 gate (Rx) and 1 parameter
def create_quantum_model():
data_qubits = cirq.GridQubit.rect(1, 1)
circuit = cirq.Circuit()
a = sympy.Symbol("a")
circuit.append(cirq.rx(a).on(data_qubits[0])),
return circuit, cirq.Z(data_qubits[0])
model_circuit, model_readout = create_quantum_model()
# Build the Keras model.
model = tf.keras.Sequential([
# The input is the data-circuit, encoded as a tf.string
tf.keras.layers.Input(shape=(), dtype=tf.string),
# The PQC layer returns the expected value of the readout gate, range [-1,1].
tfq.layers.PQC(model_circuit, model_readout),
])
# used for logging progress during optimization
def hinge_accuracy(y_true, y_pred):
y_true = tf.squeeze(y_true) > 0.0
y_pred = tf.squeeze(y_pred) > 0.0
result = tf.cast(y_true == y_pred, tf.float32)
return tf.reduce_mean(result)
# compile the model with Hinge loss and Adam, as done in the example. Have tried with various learning_rates
model.compile(
loss = tf.keras.losses.Hinge(),
optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
metrics=[hinge_accuracy])
EPOCHS = 20
BATCH_SIZE = 32
NUM_EXAMPLES = 1000
# fit the model
qnn_history = model.fit(
x_train_tfcirc, y_train,
batch_size=32,
epochs=EPOCHS,
verbose=1,
validation_data=(x_test_tfcirc, y_test),
use_multiprocessing=False)
results = model.predict(x_test_tfcirc)
results_mapped = [-1 if x<=0 else 1 for x in results[:,0]]
print(np.sum(np.equal(results_mapped, y_test)))
After 20 epochs of optimization, I get the following:
1000/1000 [==============================] - 0s 410us/sample - loss: 0.5589 - hinge_accuracy: 0.6982 - val_loss: 0.5530 - val_hinge_accuracy: 0.7070
This results in 700 samples out of 1000 predicted correctly. When looking at the mapped results, this is because all results are predicted as -1. When looking at the raw results, they linearly increase from -0.5484014 to -0.99996257.
When retrieving the weight with w = model.layers[0].get_weights(), subtracting 0.8, and setting it again with model.layers[0].set_weights(w), I get 920/1000 correct. Fine-tuning this process allows me to achieve 1000/1000.
Update 1:
I have also printed the update of the weight over the various epochs:
4.916246, 4.242602, 3.3765688, 2.6855211, 2.3405066, 2.206207, 2.1734586, 2.1656137, 2.1510274, 2.1634471, 2.1683235, 2.188944, 2.1510284, 2.1591303, 2.1632445, 2.1542525, 2.1677444, 2.1702878, 2.163104, 2.1635907
I set the weight to 1.36, a value which gives 908/1000 (as opposed to 700/100). The optimizer moves away from it:
1.7992111, 2.0727847, 2.1370323, 2.15711, 2.1686404, 2.1603785, 2.183334, 2.1563332, 2.156857, 2.169908, 2.1658351, 2.170673, 2.1575692, 2.1505954, 2.1561477, 2.1754034, 2.1545155, 2.1635509, 2.1464484, 2.1707492
One thing that I noticed is that the value for the hinge accuracy was 0.75 with the weight 1.36, which is higher than the 0.7 for 2.17. If this is the case, I am either in an unlucky part of the optimization landscape where the global minimum does not correspond to the minimum of the loss landscape, or the loss value is determined incorrectly. This is what I will be investigating next.
The minima of the Hinge loss function for this examples does not correspond with the maxima of number of correctly classified examples. Please see plot of these w.r.t. the value of the parameter. Given that the optimizer works towards the minima of the loss, not the maxima of the number of classified examples, the code (and framework/optimizer) do what they are supposed to do. Alternatively, one could use a different loss function to try to find a better fit. For example binarized l1 loss. This function would have the same global optimum, but would likely have a very flat landscape.
I want to implement word2vec using tensorflow 2.0
I have prepared dataset according to the skip-gramm model and I have got approx. 18 million observations(target and context words).
I have used the followng dataset for my goal:
https://www.kaggle.com/c/quora-question-pairs/notebooks
I have created a new dataset for n-gramm model. I have used windows_size 2 and number of skips equal to 2 as well in order to create for each target word(as our input) create context word(that is what I have to predict). It looks like this:
target context
1 3
1 1
2 1
2 1222
Here is my code:
dataset_train = tf.data.Dataset.from_tensor_slices((target, context))
dataset_train = dataset_train.shuffle(buffer_size=1024).batch(64)
#Parameters:
num_words = len(word_index)#approximately 100000
embed_size = 300
num_sampled = 64
initializer_softmax = tf.keras.initializers.GlorotUniform()
#Variables:
embeddings_weight = tf.Variable(tf.random.uniform([num_words,embed_size],-1.0,1.0))
softmax_weight = tf.Variable(initializer_softmax([num_words,embed_size]))
softmax_bias = tf.Variable(initializer_softmax([num_words]))
optimizer = tf.keras.optimizers.Adam()
#As before, we are supplying a list of integers (that correspond to our validation vocabulary words) to the embedding_lookup() function, which looks up these rows in the normalized_embeddings tensor, and returns the subset of validation normalized embeddings.
#Now that we have the normalized validation tensor, valid_embeddings, we can multiply this by the full normalized vocabulary (normalized_embedding) to finalize our similarity calculation:
#tf.function
def training(X,y):
with tf.GradientTape() as tape:
embed = tf.nn.embedding_lookup(embeddings_weight,X)
loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(weights = softmax_weight, biases = softmax_bias, inputs = embed,
labels = y, num_sampled = num_sampled, num_classes = num_words))
variables = [embeddings_weight,softmax_weight,softmax_bias]
gradients = tape.gradient(loss,variables)
optimizer.apply_gradients(zip(gradients,variables))
EPOCHS = 30
for epoch in range(EPOCHS):
print('Epoch:',epoch)
for X,y in dataset_train:
training(X,y)
#compute similarity of words:
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings_weight), 1, keepdims=True))
norm_embed = embeddings_weight/ norm
temp_emb = tf.nn.embedding_lookup(norm_embed,X)
similarity = tf.matmul(temp_emb,tf.transpose(norm_embed))
But the computation of even 1 epoch lasts too long. Is it possible somehow to improve the perfomance of my code?(I am using google colab for the code execution)
EDIT: this is a shape of my train dataset
dataset_train
<BatchDataset shapes: ((None,), (None, 1)), types: (tf.int64, tf.int64)>
I was following the instructions from this guide: https://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/
This is because softmax function is computationally quite expensive while dealing with possibilities of millions of points in Word2Vec algorithm as explained here. A faster training would be possible with negative sampling.
I have a convolutional neural network with three images as inputs:
x_anchor = tf.placeholder('float', [None, 4900], name='x_anchor')
x_positive = tf.placeholder('float', [None, 4900], name='x_positive')
x_negative = tf.placeholder('float', [None, 4900], name='x_negative')
Within a train function, I feed the placeholders with the actual images:
input1, input2, input3 = training.next_batch(start,end)
....some other operations...
loss_value = sess.run([cost], feed_dict={x_anchor:input1, x_positive:input2, x_negative:input3})
I'm using a triplet loss function on these three inputs (that's actually the cost variable above):
def triplet_loss(d_pos, d_neg):
margin = 0.2
loss = tf.reduce_mean(tf.maximum(0., margin + d_pos - d_neg))
return loss
How can I filter the losses, so only the images with loss_value > 0 will be used to train the network?
How can I implement something like:
if(loss_value for input1, input2, input3 > 0)
use inputs to train network
else
do nothing/try another input
What I have tried so far:
I took the images one by one (input1[0], input2[0], input3[0]), calculated the loss, and if the loss was positive I would calculate (and apply) the gradients. But the problem is I use dropout in my model and I have to apply the model twice on my inputs:
First to calculate the loss and verify whether it's greater than 0
Second to run the optimizer: this is when things go wrong. As I mentioned before, I use dropout, so the results of the model on my inputs are different, so the new loss will sometimes be 0 even if the loss determined at step 1 is greater than 0.
I also tried to use tf.py_func but got stuck.
There's a new TensorFlow feature called “AutoGraph”. AutoGraph converts Python code, including control flow, print() and other Python-native features, into pure TensorFlow graph code. For example:
#autograph.convert()
def huber_loss(a):
if tf.abs(a) <= delta:
loss = a * a / 2
else:
loss = delta * (tf.abs(a) - delta / 2)
return loss
becomes this code at execution time due to the decorator:
def tf__huber_loss(a):
with tf.name_scope('huber_loss'):
def if_true():
with tf.name_scope('if_true'):
loss = a * a / 2
return loss,
def if_false():
with tf.name_scope('if_false'):
loss = delta * (tf.abs(a) - delta / 2)
return loss,
loss = ag__.utils.run_cond(tf.less_equal(tf.abs(a), delta), if_true,
if_false)
return loss
What you wanted to do could have been implemented before using tf.cond().
I found out about this through this medium post.
I started to play with TensorFlow two days ago and I'm wondering if there is the triplet and the contrastive losses implemented.
I've been looking at the documentation, but I haven't found any example or description about these things.
Update (2018/03/19): I wrote a blog post detailing how to implement triplet loss in TensorFlow.
You need to implement yourself the contrastive loss or the triplet loss, but once you know the pairs or triplets this is quite easy.
Contrastive Loss
Suppose you have as input the pairs of data and their label (positive or negative, i.e. same class or different class). For instance you have images as input of size 28x28x1:
left = tf.placeholder(tf.float32, [None, 28, 28, 1])
right = tf.placeholder(tf.float32, [None, 28, 28, 1])
label = tf.placeholder(tf.int32, [None, 1]). # 0 if same, 1 if different
margin = 0.2
left_output = model(left) # shape [None, 128]
right_output = model(right) # shape [None, 128]
d = tf.reduce_sum(tf.square(left_output - right_output), 1)
d_sqrt = tf.sqrt(d)
loss = label * tf.square(tf.maximum(0., margin - d_sqrt)) + (1 - label) * d
loss = 0.5 * tf.reduce_mean(loss)
Triplet Loss
Same as with contrastive loss, but with triplets (anchor, positive, negative). You don't need labels here.
anchor_output = ... # shape [None, 128]
positive_output = ... # shape [None, 128]
negative_output = ... # shape [None, 128]
d_pos = tf.reduce_sum(tf.square(anchor_output - positive_output), 1)
d_neg = tf.reduce_sum(tf.square(anchor_output - negative_output), 1)
loss = tf.maximum(0., margin + d_pos - d_neg)
loss = tf.reduce_mean(loss)
The real trouble when implementing triplet loss or contrastive loss in TensorFlow is how to sample the triplets or pairs. I will focus on generating triplets because it is harder than generating pairs.
The easiest way is to generate them outside of the Tensorflow graph, i.e. in python and feed them to the network through the placeholders. Basically you select images 3 at a time, with the first two from the same class and the third from another class. We then perform a feedforward on these triplets, and compute the triplet loss.
The issue here is that generating triplets is complicated. We want them to be valid triplets, triplets with a positive loss (otherwise the loss is 0 and the network doesn't learn).
To know whether a triplet is good or not you need to compute its loss, so you already make one feedforward through the network...
Clearly, implementing triplet loss in Tensorflow is hard, and there are ways to make it more efficient than sampling in python but explaining them would require a whole blog post !
Triplet loss with semihard negative mining is now implemented in tf.contrib, as follows:
triplet_semihard_loss(
labels,
embeddings,
margin=1.0
)
where:
Args:
labels: 1-D tf.int32 Tensor with shape [batch_size] of multiclass
integer labels.
embeddings: 2-D float Tensor of embedding vectors.Embeddings should
be l2 normalized.
margin: Float, margin term in theloss definition.
Returns:
triplet_loss: tf.float32 scalar.
For further information, check the link bellow:
https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/losses/metric_learning/triplet_semihard_loss
Tiago, I don't think you are using the same formula Olivier gave.
Here is the right code (not sure it will work though, just fixing the formula) :
def compute_euclidean_distance(x, y):
"""
Computes the euclidean distance between two tensorflow variables
"""
d = tf.reduce_sum(tf.square(tf.sub(x, y)),1)
return d
def compute_contrastive_loss(left_feature, right_feature, label, margin):
"""
Compute the contrastive loss as in
L = 0.5 * Y * D^2 + 0.5 * (Y-1) * {max(0, margin - D)}^2
**Parameters**
left_feature: First element of the pair
right_feature: Second element of the pair
label: Label of the pair (0 or 1)
margin: Contrastive margin
**Returns**
Return the loss operation
"""
label = tf.to_float(label)
one = tf.constant(1.0)
d = compute_euclidean_distance(left_feature, right_feature)
d_sqrt = tf.sqrt(compute_euclidean_distance(left_feature, right_feature))
first_part = tf.mul(one-label, d)# (Y-1)*(d)
max_part = tf.square(tf.maximum(margin-d_sqrt, 0))
second_part = tf.mul(label, max_part) # (Y) * max(margin - d, 0)
loss = 0.5 * tf.reduce_mean(first_part + second_part)
return loss