tensorflow how to pad batched text like pytorch's 'collate_fn'? - tensorflow

I want to pad a batch of text into same length, generate segment id, mask vector, and then feed them to bert model.
In pytorch, I can use the collate_fn like below.
def collate_fn(self, batch):
rows = self.df.iloc[batch] # take a batch of data
ids, seg_ids = self.get_ids_segs(rows) # process data
attention_mask = (ids > 0)
return ids, seg_ids,attention_mask
But in tensorflow, the data is pass by a tuple of matrix, thus all the text are padded into the max length 512.
# ids.shape = seg_ids = attention_mask = (data_number, max_seq_len)
xs = (ids, seg_ids, attention_mask)
model.fit(xs,, ys, batch_size=batch_size)
I found tf.data.dataset has a function padded_batch. But it can only pad one input, what I have is 3 input data, ids, seq_ids, attn_mask.

Probably using apply or map method of
tf.data.Dataset
after applying batch method should solve the problem.

Related

Proper masking in MultiHeadAttention layer in Keras

I am new to Transformers and I am trying to create a very simple model (not NLP area) for processing data of variable length (not sequence data because for my problem order in data does not matter).
Basically, max length of data that I defined (number of vectors) is 10, and each vector has dimension 2. Because of problem domain, different inputs have different number of vectors, but the rest of input tensor is always padded with some value (e.g. -10000 because 0 has certain meaning for my data).
Below is example of 1-batch input with 4 vectors that have some meaning and other vectors with -1.0e+5 pad value.
array([[[ 1.7e-01, -2.2e-01],
[ 1.7e-01, 1.8e-01],
[-3.7e-01, 3.7e-01],
[-3.7e-01, 8.0e-02],
[-1.0e+05, -1.0e+05],
[-1.0e+05, -1.0e+05],
[-1.0e+05, -1.0e+05],
[-1.0e+05, -1.0e+05],
[-1.0e+05, -1.0e+05],
[-1.0e+05, -1.0e+05]]])
Now, I am using Keras MultiHeadAttention layer that has the option of masking part of the input for attention weigths. Call argument for this option is attention_mask described in Keras docs:
a boolean mask of shape (B, T, S), that prevents attention to certain positions. The boolean mask specifies which query elements can attend to which key elements, 1 indicates attention and 0 indicates no attention. Broadcasting can happen for the missing batch dimensions and the head dimension
So the mask should be tensor of zeros and ones, with ones at positions for which attention will be calculated.
For my problem queries, keys and values are all the same (input data), and the model looks like this:
def build_multihead_attention_model():
input_layer = Input(shape = (10, 2), name = 'input')
mask = ...mask somehow caluctaed for input_layer
multihead_layer = MultiHeadAttention(num_heads=1, key_dim=3)
attention_output = multihead_layer(input_layer, input_layer, attention_mask = mask, return_attention_scores = True)
model = Model(inputs = input_layer, outputs = attention_output)
return model
I tried to find some easy way how to calculate this mask depending on the input layer (number of input vectors that are not padded vectors), but I wasn't successful.
How should this mask be calculated?
Input data are just numbers, not words or not embeddings.
Order in data does not matter, but padded vectors are at the end of the input tensor.
Is there already a layer for this that could be used, like Masking layer in Keras?

How to correctly ignore padded or missing timesteps at decoding time in multi-feature sequences with LSTM autonecoder

I am trying to learn a latent representation for text sequence (multiple features (3)) by doing reconstruction USING AUTOENCODER. As some of the sequences are shorter than the maximum pad length or a number of time steps I am considering (seq_length=15), I am not sure if reconstruction will learn to ignore the timesteps or not for calculating loss or accuracies.
I followed suggestions from this answer to crop the outputs but my losses are nan and several of accuracies as well.
input1 = keras.Input(shape=(seq_length,),name='input_1')
input2 = keras.Input(shape=(seq_length,),name='input_2')
input3 = keras.Input(shape=(seq_length,),name='input_3')
input1_emb = layers.Embedding(70,32,input_length=seq_length,mask_zero=True)(input1)
input2_emb = layers.Embedding(462,192,input_length=seq_length,mask_zero=True)(input2)
input3_emb = layers.Embedding(84,36,input_length=seq_length,mask_zero=True)(input3)
merged = layers.Concatenate()([input1_emb, input2_emb,input3_emb])
activ_func = 'tanh'
encoded = layers.LSTM(120,activation=activ_func,input_shape=(seq_length,),return_sequences=True)(merged) #
encoded = layers.LSTM(60,activation=activ_func,return_sequences=True)(encoded)
encoded = layers.LSTM(15,activation=activ_func)(encoded)
# Decoder reconstruct inputs
decoded1 = layers.RepeatVector(seq_length)(encoded)
decoded1 = layers.LSTM(60, activation= activ_func , return_sequences=True)(decoded1)
decoded1 = layers.LSTM(120, activation= activ_func , return_sequences=True,name='decoder1_last')(decoded1)
Decoder one has an output shape of (None, 15, 120).
input_copy_1 = layers.TimeDistributed(layers.Dense(70, activation='softmax'))(decoded1)
input_copy_2 = layers.TimeDistributed(layers.Dense(462, activation='softmax'))(decoded1)
input_copy_3 = layers.TimeDistributed(layers.Dense(84, activation='softmax'))(decoded1)
For each output, I am trying to crop the O padded timesteps as suggested by this answer. padding has 0 where actual input was missing (had zero due to padding) and 1 otherwise
#tf.function
def cropOutputs(x):
#x[0] is softmax of respective feature (time distributed) on top of decoder
#x[1] is the actual input feature
padding = tf.cast( tf.not_equal(x[1][1],0), dtype=tf.keras.backend.floatx())
print(padding)
return x[0]*tf.tile(tf.expand_dims(padding, axis=-1),tf.constant([1,x[0].shape[2]], tf.int32))
Applying crop function to all three outputs.
input_copy_1 = layers.Lambda(cropOutputs, name='input_copy_1', output_shape=(None, 15, 70))([input_copy_1,input1])
input_copy_2 = layers.Lambda(cropOutputs, name='input_copy_2', output_shape=(None, 15, 462))([input_copy_2,input2])
input_copy_3 = layers.Lambda(cropOutputs, name='input_copy_3', output_shape=(None, 15, 84))([input_copy_3,input3])
My logic is to crop timesteps of each feature (all 3 features for sequence have the same length, meaning they miss timesteps together). But for timestep, they have been applied softmax as per their feature size (70,462,84) so I have to zero out timestep by making a multi-dimensional mask array of zeros or ones equal to this feature size with help of mask padding, and multiply by respective softmax representation using this using multi-dimensional mask array.
I am not sure I am doing this right or not as I have Nan losses for these inputs as well as other accuracies have that I am learning jointly with this task (it happens only with this cropping thing).
If it helps someone, I end up cropping the padded entries from the loss directly (taking some keras code pointer from these answers).
#tf.function
def masked_cc_loss(y_true, y_pred):
mask = tf.keras.backend.all(tf.equal(y_true, masked_val_hotencoded), axis=-1)
mask = 1 - tf.cast(mask, tf.keras.backend.floatx())
loss = tf.keras.losses.CategoricalCrossentropy()(y_true, y_pred) * mask
return tf.keras.backend.sum(loss) / tf.keras.backend.sum(mask) # averaging by the number of unmasked entries

Using One Hot Encodings

Problem definition:
Implement the function below to take one label and the total number of classes 𝐶 , and return the one hot encoding in a column wise matrix. Use tf.one_hot() to do this, and tf.reshape() to reshape your one hot tensor!
tf.reshape(tensor, shape)
enter code here
def one_hot_matrix(label, depth=6):
"""
    Computes the one hot encoding for a single label
    
    Arguments:
label -- (int) Categorical labels
depth -- (int) Number of different classes that label can take
    
    Returns:
one_hot -- tf.Tensor A single-column matrix with the one hot encoding.
"""
# (approx. 1 line)
# one_hot = ...
# YOUR CODE STARTS HERE
# YOUR CODE ENDS HERE
return one_hot
enter code here
when you take this one serious "# (approx. 1 line)"
one_hot = tf.reshape(tf.one_hot(label,depth,axis = 0), [depth, ])
one_hot = tf.one_hot(label, depth, axis = 0)
one_hot = tf.reshape(one_hot, (-1,1))
one_hot = tf.reshape(tf.one_hot(label,depth,axis=0), (depth))

Keras data generator predict same number of values

I have implemented a CNN-based regression model that uses a data generator to use the huge amount of data I have. Training and evaluation work well, but there's an issue with the prediction. If for example I want to predict values from a test dataset of 50 samples, I use model.predict with a batch size of 5. The problem is that model.predict returns 5 values repeated 10 times, instead of 50 different values . The same thing happens if I change to batch size to 1, it will return one value 50 times.
To solve this issue, I used a full batch size (50 in my example), and it worked. But I can't I use this method on my whole test data because it's too huge.
Do you have any other solution, or what is the problem in my approach?
My data generator code:
import numpy as np
import keras
class DataGenerator(keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, list_IDs, data_X, data_Z, target_y batch_size=32, dim1=(120,120),
dim2 = 80, n_channels=1, shuffle=True):
'Initialization'
self.dim1 = dim1
self.dim2 = dim2
self.batch_size = batch_size
self.data_X = data_X
self.data_Z = data_Z
self.target_y = target_y
self.list_IDs = list_IDs
self.n_channels = n_channels
self.shuffle = shuffle
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in range(len(indexes))]
# Generate data
([X, Z], y) = self.__data_generation(list_IDs_temp)
return ([X, Z], y)
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle == True:
np.random.shuffle(self.indexes)
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim1, self.n_channels))
Z = np.empty((self.batch_size, self.dim2))
y = np.empty((self.batch_size))
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Store sample
X[i,] = np.load('data/' + data_X + ID + '.npy')
Z[i,] = np.load('data/' + data_Z + ID + '.npy')
# Store target
y[i] = np.load('data/' + target_y + ID + '.npy')
How I call model.predict()
predict_params = {'list_IDs': 'indexes',
'data_X': 'images',
'data_Z': 'factors',
'target_y': 'True_values'
'batch_size': 5,
'dim1': (120,120),
'dim2': 80,
'n_channels': 1,
'shuffle'=False}
# Prediction generator
prediction_generator = DataGenerator(test_index, **predict_params)
predition_results = model.predict(prediction_generator, steps = 1, verbose=1)
If we look at your __getitem__ function, we can see this code:
list_IDs_temp = [self.list_IDs[k] for k in range(len(indexes))]
This code will always return the same numbers IDs, because the length len of the indexes is always the same (at least as long as all batches have an equal amount of samples) and we just loop over the first couple of indexes every time.
You are already extracting the indexes of the current batch beforehand, so the line with the error is not needed at all. The following code should work:
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
list_IDs_temp = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Generate data
([X, Z], y) = self.__data_generation(list_IDs_temp)
return ([X, Z], y)
See if this code works and you get different results. You should now get bad predictions, because during training, your model would also only have trained on the same few data points as of now.
When you use a generator you specify a batch size. model.predict will produce batch size number of output predictions. If you set steps=1 that is all the predictions you will get. To set the steps you should take the number of samples you have and divide it by the batch size. For example if you have 50 images with a batch size of 5 then you should set steps equal to 10. Ideally you want to go through your test set exactly once. The code below will determine the batch size and steps to do that. In the code b_max is a value you select that limits the maximum batch size . You should set this based on your memory size to avoid a OOM (out of memory) error. In the code below parameter length is equal to the number of test samples you have.
length=500
b_max=80
batch_size=sorted([int(length/n) for n in range(1,length+1) if length % n ==0 and length/n<=b_max],reverse=True)[0]
steps=int(length/batch_size)
The result will be batch_size= 50 steps=10. Note if length is a prime number the result will be batch_size=1 and steps=length
According to this solution, you need to change your steps to the total number of images you want to test on. Try:
# Assuming test_index is a list
predition_results = model.predict(prediction_generator, steps = len(test_index), verbose=1)

Variable batch size in tensorflow and CNN

I want to feed in a 1-D CNN a sequence of fixed length and want it to make a prediction (regression), but I want to have a variable batch size during training. The tutorials are not really helpful.
In my input layer I have something like this:
input = tf.placeholder(tf.float32, [None, sequence_length], name="input")
y = tf.placeholder(tf.float32, [None, 1], name="y")
so I assume the None dimension, can be the a variable batch size of any number, so the current input dimension is batch_size * sequence_length and I am supposed to feed the network a 2d np array with dimensions any * sequence_length
tf.nn.conv1d expects 3-D, since my input is a single channel that is 1 np array of sequence_length observations the input I will need to feed to the cnn should be 1*batch_size * sequence_length, if I had on the other hand 2 different sequences that I combine to predict a single value in the end it would have been 2*batch_size * sequence_length and I would also need to concatenate the 2 different channels. So in my case I need
input = tf.expand_dims(input, -1)
and then the filter also follow the same:
filter_size = 5
channel_size = 1
num_filters = 10
filter_shape = [filter_size, channel_size, num_filters]
filters = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="filters")
tf.nn.conv1d(value=input, filters=filters, stride=1)
After that I add a FC layer, but the network isn't able to learn anything, even the a basic function such as sin(x), does the code above look correct?
Also how can I do a maxpooling?