How to correctly ignore padded or missing timesteps at decoding time in multi-feature sequences with LSTM autonecoder - tensorflow

I am trying to learn a latent representation for text sequence (multiple features (3)) by doing reconstruction USING AUTOENCODER. As some of the sequences are shorter than the maximum pad length or a number of time steps I am considering (seq_length=15), I am not sure if reconstruction will learn to ignore the timesteps or not for calculating loss or accuracies.
I followed suggestions from this answer to crop the outputs but my losses are nan and several of accuracies as well.
input1 = keras.Input(shape=(seq_length,),name='input_1')
input2 = keras.Input(shape=(seq_length,),name='input_2')
input3 = keras.Input(shape=(seq_length,),name='input_3')
input1_emb = layers.Embedding(70,32,input_length=seq_length,mask_zero=True)(input1)
input2_emb = layers.Embedding(462,192,input_length=seq_length,mask_zero=True)(input2)
input3_emb = layers.Embedding(84,36,input_length=seq_length,mask_zero=True)(input3)
merged = layers.Concatenate()([input1_emb, input2_emb,input3_emb])
activ_func = 'tanh'
encoded = layers.LSTM(120,activation=activ_func,input_shape=(seq_length,),return_sequences=True)(merged) #
encoded = layers.LSTM(60,activation=activ_func,return_sequences=True)(encoded)
encoded = layers.LSTM(15,activation=activ_func)(encoded)
# Decoder reconstruct inputs
decoded1 = layers.RepeatVector(seq_length)(encoded)
decoded1 = layers.LSTM(60, activation= activ_func , return_sequences=True)(decoded1)
decoded1 = layers.LSTM(120, activation= activ_func , return_sequences=True,name='decoder1_last')(decoded1)
Decoder one has an output shape of (None, 15, 120).
input_copy_1 = layers.TimeDistributed(layers.Dense(70, activation='softmax'))(decoded1)
input_copy_2 = layers.TimeDistributed(layers.Dense(462, activation='softmax'))(decoded1)
input_copy_3 = layers.TimeDistributed(layers.Dense(84, activation='softmax'))(decoded1)
For each output, I am trying to crop the O padded timesteps as suggested by this answer. padding has 0 where actual input was missing (had zero due to padding) and 1 otherwise
#tf.function
def cropOutputs(x):
#x[0] is softmax of respective feature (time distributed) on top of decoder
#x[1] is the actual input feature
padding = tf.cast( tf.not_equal(x[1][1],0), dtype=tf.keras.backend.floatx())
print(padding)
return x[0]*tf.tile(tf.expand_dims(padding, axis=-1),tf.constant([1,x[0].shape[2]], tf.int32))
Applying crop function to all three outputs.
input_copy_1 = layers.Lambda(cropOutputs, name='input_copy_1', output_shape=(None, 15, 70))([input_copy_1,input1])
input_copy_2 = layers.Lambda(cropOutputs, name='input_copy_2', output_shape=(None, 15, 462))([input_copy_2,input2])
input_copy_3 = layers.Lambda(cropOutputs, name='input_copy_3', output_shape=(None, 15, 84))([input_copy_3,input3])
My logic is to crop timesteps of each feature (all 3 features for sequence have the same length, meaning they miss timesteps together). But for timestep, they have been applied softmax as per their feature size (70,462,84) so I have to zero out timestep by making a multi-dimensional mask array of zeros or ones equal to this feature size with help of mask padding, and multiply by respective softmax representation using this using multi-dimensional mask array.
I am not sure I am doing this right or not as I have Nan losses for these inputs as well as other accuracies have that I am learning jointly with this task (it happens only with this cropping thing).

If it helps someone, I end up cropping the padded entries from the loss directly (taking some keras code pointer from these answers).
#tf.function
def masked_cc_loss(y_true, y_pred):
mask = tf.keras.backend.all(tf.equal(y_true, masked_val_hotencoded), axis=-1)
mask = 1 - tf.cast(mask, tf.keras.backend.floatx())
loss = tf.keras.losses.CategoricalCrossentropy()(y_true, y_pred) * mask
return tf.keras.backend.sum(loss) / tf.keras.backend.sum(mask) # averaging by the number of unmasked entries

Related

ValueError: Dimensions must be equal in Tensorflow/Keras

My codes are as follow:
v = tf.Variable(initial_value=v, trainable=True)
v.shape is (1, 768)
In the model:
inputs_sents = keras.Input(shape=(50,3))
inputs_events = keras.Input(shape=(50,768))
x_1 = tf.matmul(v,tf.transpose(inputs_events))
x_2 = tf.matmul(x_1,inputs_sents)
But I got an error,
ValueError: Dimensions must be equal, but are 768 and 50 for
'{{node BatchMatMulV2_3}} =
BatchMatMulV2[T=DT_FLOAT,
adj_x=false,
adj_y=false](BatchMatMulV2_3/ReadVariableOp,
Transpose_3)' with input shapes: [1,768], [768,50,?]
I think it takes consideration of the batch? But how shall I deal with this?
v is a trainable vector (or 2d array with first dimension being 1), I want it to be trained in the training process.
PS: This is the result I got using the codes provided by the first answer, I think it is incorrect cause keras already takes consideration of the first batch dimension.
Plus, from the keras documentation,
shape: A shape tuple (integers), not including the batch size. For instance, shape=(32,) indicates that the expected input will be batches of 32-dimensional vectors. Elements of this tuple can be None; 'None' elements represent dimensions where the shape is not known.
https://keras.io/api/layers/core_layers/input/
Should I rewrite my codes without keras?
The shape of a batch is denoted by None:
import numpy as np
inputs_sents = keras.Input(shape=(None,1,3))
inputs_events = keras.Input(shape=(None,1,768))
v = np.ones(shape=(1,768), dtype=np.float32)
v = tf.Variable(initial_value=v, trainable=True)
x_1 = tf.matmul(v,tf.transpose(inputs_events))
x_2 = tf.matmul(x_1,inputs_sents)

Two input layers for LSTM Neural Network?

I am now building a neural network, and I am facing the task of adding another input layer (since now I just needed one).
In particular, this was the code previously:
###...
if(self.net_embedding==0):
l_input = Input(shape=self.win_size, dtype='int32', name='input_act')
emb_input = Embedding(output_dim=params["output_dim_embedding"], input_dim=unique_events + 1, input_length=self.win_size)(l_input)
toBePassed=emb_input
elif(self.net_embedding==1):
self.getWord2VecEmbeddings(params['word2vec_size'])
X_train=self.encodePrefixes(params['word2vec_size'],X_train)
l_input = Input(shape = (self.win_size, params['word2vec_size']), name = 'input_act')
toBePassed=l_input
l1 = LSTM(params["shared_lstm_size"],return_sequences=True, kernel_initializer='glorot_uniform',dropout=params['dropout'])(toBePassed)
l1 = BatchNormalization()(l1)
#and so on with the rest of the layers...
The input of the model (X_train) was just an array of arrays (with size = self.win_size) of integers (e.g. [[0 1 2 3] [1 2 3 4]...] if self.win_size = 4), where the integers represent categorical elements.
As you can see, I also have two types of embeddings for this input:
Embedding layer
Word2Vec encoding
Now, I need to add another input to the net, which is as well an array of arrays (with size = self.win_size again) of integers (eg. [[0 123 334 2212][123 334 2212 4888]...], but this time I don't need to apply any embedding (I think) because the elements here are not categorical (they represent elapsed time in seconds).
I tried by simply changing the net to:
#...
if(self.net_embedding==0):
l_input = Input(shape=self.win_size, dtype='int32', name='input_act')
emb_input = Embedding(output_dim=params["output_dim_embedding"], input_dim=unique_events + 1, input_length=self.win_size)(l_input)
toBePassed=emb_input
elif(self.net_embedding==1):
self.getWord2VecEmbeddings(params['word2vec_size'])
X_train=self.encodePrefixes(params['word2vec_size'],X_train)
l_input = Input(shape = (self.win_size, params['word2vec_size']), name = 'input_act')
toBePassed=l_input
elapsed_time_input = Input(shape=self.win_size, name='input_time')
input_concat = Concatenate(axis=1)([toBePassed, elapsed_time_input])
l1 = LSTM(params["shared_lstm_size"],return_sequences=True, kernel_initializer='glorot_uniform',dropout=params['dropout'])(input_concat)
l1 = BatchNormalization()(l1)
#and so on with other layers...
but I get the error:
ValueError: A `Concatenate` layer requires inputs with matching shapes except for the concatenation axis. Received: input_shape=[(None, 4, 12), (None, 4)]
Do you please have any solution for this? Any kind of help would be really appreciated, since I have a deadline in a few days and I'm smashing my head on this for so long now! Thanks :)
There are two problems with your approach.
First, inputs to LSTM should have a shape of (batch_size, num_steps, num_feats), yet your elapsed_time_input has shape (None, 4). You need to expand its dimension to get the proper shape (None, 4, 1).
elapsed_time_input = tf.keras.layers.Reshape((-1, 1))(elapsed_time_input)
or
elapsed_time_input = tf.expand_dims(elapsed_time_input, axis=-1)
With this, "elapsed time in seconds" will be seen as just another feature of a timestep.
Secondly, you'll want to concatenate the two inputs in the feature dimension (not the timestep dimension).
input_concat = Concatenate(axis=-1)([toBePassed, elapsed_time_input])
or
input_concat = Concatenate(axis=2)([toBePassed, elapsed_time_input])
After this, you'll get a keras tensor with a shape of (None, 4, 13). It represents a batch of time series, each having 4 timesteps and 13 features per step (12 original features + elapsed time in second for each step).

Variable batch size in tensorflow and CNN

I want to feed in a 1-D CNN a sequence of fixed length and want it to make a prediction (regression), but I want to have a variable batch size during training. The tutorials are not really helpful.
In my input layer I have something like this:
input = tf.placeholder(tf.float32, [None, sequence_length], name="input")
y = tf.placeholder(tf.float32, [None, 1], name="y")
so I assume the None dimension, can be the a variable batch size of any number, so the current input dimension is batch_size * sequence_length and I am supposed to feed the network a 2d np array with dimensions any * sequence_length
tf.nn.conv1d expects 3-D, since my input is a single channel that is 1 np array of sequence_length observations the input I will need to feed to the cnn should be 1*batch_size * sequence_length, if I had on the other hand 2 different sequences that I combine to predict a single value in the end it would have been 2*batch_size * sequence_length and I would also need to concatenate the 2 different channels. So in my case I need
input = tf.expand_dims(input, -1)
and then the filter also follow the same:
filter_size = 5
channel_size = 1
num_filters = 10
filter_shape = [filter_size, channel_size, num_filters]
filters = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="filters")
tf.nn.conv1d(value=input, filters=filters, stride=1)
After that I add a FC layer, but the network isn't able to learn anything, even the a basic function such as sin(x), does the code above look correct?
Also how can I do a maxpooling?

How does a 1D multi-channel convolutional layer (Keras) train?

I am working with time series EEG data recorded from 10 individual locations on the body to classify future behavior in terms of increasing heart activity. I would like to better understand how my labeled data corresponds to the training inputs.
So far, several RNN configurations as well as countless combinations of vanilla dense networks have not gotten me great results and I'd figure a 1D convnet is worth a try.
The things I'm having trouble understanding are:
1.) Feeding data into the model.
orig shape = (30000 timesteps, 10 channels)
array fed to layer = (300 slices, 100 timesteps, 10 channels)
Are the slices separated by 1 time step, giving me 300 slices of timesteps at either end of the original array, or are they separated end to end? If the second is true, how could I create an array of (30000 - 100) slices separated by one ts and is also compatible with the 1D CNN layer?
2) Matching labels with the training and testing data
My understanding is that when you feed in a sequence of train_x_shape = (30000, 10), there are 30000 labels with train_y_shape = (30000, 2) (2 classes) associated with the train_x data.
So, when (300 slices of) 100 timesteps of train_x data with shape = (300, 100, 10) are fed into the model, does the label value correspond to the entire 100 ts (one label per 100 ts, with this label being equal to the last time step's label), or are each 100 rows/vectors in the slice labeled- one for each ts?
Train input:
train_x = train_x.reshape(train_x.shape[0], 1, train_x.shape[1])
n_timesteps = 100
n_channels = 10
layer : model.add(Convolution1D(filters = n_channels * 2, padding = 'same', kernel_size = 3, input_shape = (n_timesteps, n_channels)))
final layer : model.add(Dense(2, activation = 'softmax'))
I use categorical_crossentropy for loss.
Answer 1
This will really depend on "how did you get those slices"?
The answer is totally dependent on what "you're doing". So, what do you want?
If you have simply reshaped (array.reshape(...)) the original array from shape (30000,10) to shape (300,100,10), the model will see:
300 individual (and not connected) sequences
100 timesteps in each sequence
Sequence 1 goes from step 0 to 299;
Sequence 2 goes from step 300 to 599 and so on.
Creating overlapping slices - Sliding window
If you want to create sequences shifted by only one timestep, make a loop for that.
import numpy as np
originalSequence = someArrayWithShape((30000,10))
newSlices = [] #empty list
start = 0
end = start + 300
while end <= 30000:
newSlices.append(originalSequence[start:end])
start+=1
end+=1
newSlices = np.asarray(newSlices)
Beware: if you do this in the input data, you will have to do a similar thing in your output data as well.
Answer2
Again, that's totally up to you. What do you want to achieve?
Convolutional layers will keep the timesteps with these options:
If you use padding='same', the final length will be the same as the input
If you don't, the final length will be reduced depending on the kernel size you choose
Recurrent layers will keep the timesteps or not depending on:
Whether you use return_sequences=True - Output has timesteps
Or you use return_sequences=False - Output has no timesteps
If you want only one output for each sequence (not per timestep):
Recurrent models:
Use LSTM(...., return_sequences=True) until the last LSTM
The last LSTM will be LSTM(..., return_sequences=False)
Convolutional models:
At some point after the convolutions, choose one of these to add:
GlobalMaxPooling1D
GlobalAveragePooling1D
Flatten (but treat the number of channels later with a Dense(2)
Reshape((2,))
I think I'd go with GlobalMaxPooling2D if using convoltions, but recurrent models seem better for this. (Not a rule, though).
You can choose to use intermediate MaxPooling1D layers to gradually reduce the length from 100 to 50, then to 25 and so on. This will probably reach a better output.
Remember to keep X and Y paired:
import numpy as np
train_x = someArrayWithShape((30000,10))
train_y = someArrayWithShape((30000,2))
newXSlices = [] #empty list
newYSlices = [] #empty list
start = 0
end = start + 300
while end <= 30000:
newXSlices.append(train_x[start:end])
newYSlices.append(train_y[end-1:end])
start+=1
end+=1
newXSlices = np.asarray(newXSlices)
newYSlices = np.asarray(newYSlices)

Implementing contrastive loss and triplet loss in Tensorflow

I started to play with TensorFlow two days ago and I'm wondering if there is the triplet and the contrastive losses implemented.
I've been looking at the documentation, but I haven't found any example or description about these things.
Update (2018/03/19): I wrote a blog post detailing how to implement triplet loss in TensorFlow.
You need to implement yourself the contrastive loss or the triplet loss, but once you know the pairs or triplets this is quite easy.
Contrastive Loss
Suppose you have as input the pairs of data and their label (positive or negative, i.e. same class or different class). For instance you have images as input of size 28x28x1:
left = tf.placeholder(tf.float32, [None, 28, 28, 1])
right = tf.placeholder(tf.float32, [None, 28, 28, 1])
label = tf.placeholder(tf.int32, [None, 1]). # 0 if same, 1 if different
margin = 0.2
left_output = model(left) # shape [None, 128]
right_output = model(right) # shape [None, 128]
d = tf.reduce_sum(tf.square(left_output - right_output), 1)
d_sqrt = tf.sqrt(d)
loss = label * tf.square(tf.maximum(0., margin - d_sqrt)) + (1 - label) * d
loss = 0.5 * tf.reduce_mean(loss)
Triplet Loss
Same as with contrastive loss, but with triplets (anchor, positive, negative). You don't need labels here.
anchor_output = ... # shape [None, 128]
positive_output = ... # shape [None, 128]
negative_output = ... # shape [None, 128]
d_pos = tf.reduce_sum(tf.square(anchor_output - positive_output), 1)
d_neg = tf.reduce_sum(tf.square(anchor_output - negative_output), 1)
loss = tf.maximum(0., margin + d_pos - d_neg)
loss = tf.reduce_mean(loss)
The real trouble when implementing triplet loss or contrastive loss in TensorFlow is how to sample the triplets or pairs. I will focus on generating triplets because it is harder than generating pairs.
The easiest way is to generate them outside of the Tensorflow graph, i.e. in python and feed them to the network through the placeholders. Basically you select images 3 at a time, with the first two from the same class and the third from another class. We then perform a feedforward on these triplets, and compute the triplet loss.
The issue here is that generating triplets is complicated. We want them to be valid triplets, triplets with a positive loss (otherwise the loss is 0 and the network doesn't learn).
To know whether a triplet is good or not you need to compute its loss, so you already make one feedforward through the network...
Clearly, implementing triplet loss in Tensorflow is hard, and there are ways to make it more efficient than sampling in python but explaining them would require a whole blog post !
Triplet loss with semihard negative mining is now implemented in tf.contrib, as follows:
triplet_semihard_loss(
labels,
embeddings,
margin=1.0
)
where:
Args:
labels: 1-D tf.int32 Tensor with shape [batch_size] of multiclass
integer labels.
embeddings: 2-D float Tensor of embedding vectors.Embeddings should
be l2 normalized.
margin: Float, margin term in theloss definition.
Returns:
triplet_loss: tf.float32 scalar.
For further information, check the link bellow:
https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/losses/metric_learning/triplet_semihard_loss
Tiago, I don't think you are using the same formula Olivier gave.
Here is the right code (not sure it will work though, just fixing the formula) :
def compute_euclidean_distance(x, y):
"""
Computes the euclidean distance between two tensorflow variables
"""
d = tf.reduce_sum(tf.square(tf.sub(x, y)),1)
return d
def compute_contrastive_loss(left_feature, right_feature, label, margin):
"""
Compute the contrastive loss as in
L = 0.5 * Y * D^2 + 0.5 * (Y-1) * {max(0, margin - D)}^2
**Parameters**
left_feature: First element of the pair
right_feature: Second element of the pair
label: Label of the pair (0 or 1)
margin: Contrastive margin
**Returns**
Return the loss operation
"""
label = tf.to_float(label)
one = tf.constant(1.0)
d = compute_euclidean_distance(left_feature, right_feature)
d_sqrt = tf.sqrt(compute_euclidean_distance(left_feature, right_feature))
first_part = tf.mul(one-label, d)# (Y-1)*(d)
max_part = tf.square(tf.maximum(margin-d_sqrt, 0))
second_part = tf.mul(label, max_part) # (Y) * max(margin - d, 0)
loss = 0.5 * tf.reduce_mean(first_part + second_part)
return loss