How to use the TensorFlow dataset API with unknown shapes properly? - tensorflow

I've been trying for several hours to complete this task with no success.
I have a very large dataset which is comprised of the following structure:
I want to split this data into X and Y (and pass Y to tf.to_categorical) as in the picture using the tf.data.Dataset API, but unfortunately every attempt of me trying to use it has ended up with some kind of error.
How do I use tf.data.Dataset to:
Split each row to x and y.
Convert Y to categorical with tf.to_categorical.
Split the dataset into batches.
Feed my model with the dataset.
My current attempt:
def map_sequence():
for sequence in input_sequences:
yield sequence[:-1], keras.utils.to_categorical(sequence[-1], total_words)
dataset = tf.data.Dataset.from_generator(map_sequence,
(tf.int32, tf.int32),
(tf.TensorShape(title_length-1), tf.TensorShape(total_words)))
But when I try to train my model with the following code:
inputs = keras.layers.Input(shape=(title_length-1, ))
x = keras.layers.Embedding(total_words, 32)(inputs)
x = keras.layers.Bidirectional(keras.layers.LSTM(64, return_sequences=True))(x)
x = keras.layers.Bidirectional(keras.layers.LSTM(64))(x)
predictions = keras.layers.Dense(total_words, activation='softmax')(x)
model = keras.Model(inputs=inputs, outputs=predictions)
model.compile('Adam', 'categorical_crossentropy', metrics=['acc'])
model.fit(dataset)
I am getting this error: ValueError: Shapes (32954, 1) and (65, 32954) are incompatible

I think you have a similar problem as in this question. Keras expects the dataset that you give to produce batches, not individual examples. Since you are giving it two one-dimensional vectors at a time, Keras interprets that each of these is a batch of examples with one feature. So, your X data, which has 65 elements, is interpreted as a batch of 65 examples with a single feature (a 65x1 tensor). This fixes the batch size to 65. The output of the model has then shape 65x32,954 (which I assume is the value of total_words). But your Y vector, with 32,954 elements, is again interpreted as a batch of 32,954 with one features (32,954x1 tensor). These two things don't match, hence the error. You should be able to fix it by simply making a new dataset with batch before passing it to fit.
In any case, if you input_sequences is a NumPy array, as it seems to be, your method to produce the dataset is not really good, as using a generator will be really slow. This is a better way to do the same:
def map_sequence(sequence):
# Using tf.one_hot instead of keras.utils.to_categorical
# because we are working with TensorFlow tensors now
return sequence[:-1], tf.one_hot(sequence[-1], total_words)
dataset = tf.data.Dataset.from_tensor_slices(input_sequences)
dataset = dataset.map(map_sequence)
dataset = dataset.batch(batch_size)

Related

how to get labels when using model.predict()

In my project, I have a number of cases where I have a Dataset instance and I need to get predictions from some model on every item in the dataset.
The model.predict() API is optimized perfectly for this, as shown in the documentation. However, there seems to be one major catch. I also happen to need the labels to compare with the predicted values, i.e. the dataset contains x,y pairs, and I'd like to end up with (y_predicted, y) pairs after the prediction is complete. This does not seem to be possible with the predict() API though, and I can't think of a clean way to 'split' the dataset so that the x's are fed into the model and the y's are retained to be joined back up with the predicted y's.
EDIT: I know it's quite simple to do by iterating over the dataset manually and calling the model directly, e.g.
for x, y in dataset:
y_pred = model(x)
result.append((y, y_pred))
However, this seems like it will be a fair bit slower than using the inbuilt predict() as Tensorflow won't be able to multi-thread/optimize the input pipeline.
Does anyone have a good way to accomplish this?
Given the concerns you mentioned, it may be best to overwrite predict to suit your needs. You don't actually need to overwrite that function though, instead only predict_step which is called by that function. Just use this class instead of Model:
class MyModel(tf.keras.Model):
def predict_step(self, data):
x, y = data
return self(x, training=False), y
If your model is currently Sequential, inherit from that instead. Basically the only change I made from the default implementation is to add , y to the model call result.
Note that this also makes some assumptions, such that your dataset consists of (input, label) batch pairs. You may need to adapt it slightly to your needs. Here is a minimal example:
import tensorflow as tf
import numpy as np
(imgs, lbls), (te_imgs, te_lbls) = tf.keras.datasets.mnist.load_data()
imgs = imgs.astype(np.float32).reshape((-1, 784)) / 255.
te_imgs = te_imgs.astype(np.float32).reshape((-1, 784)) / 255.
lbls = lbls.astype(np.int32)
te_lbls = te_lbls.astype(np.int32)
tr_data = tf.data.Dataset.from_tensor_slices((imgs, lbls)).shuffle(60000).batch(128)
te_data = tf.data.Dataset.from_tensor_slices((te_imgs, te_lbls)).batch(128)
class MyModel(tf.keras.Model):
def predict_step(self, data):
x, y = data
return self(x, training=False), y
inp = tf.keras.Input((784,))
logits = tf.keras.layers.Dense(10)(inp)
model = MyModel(inp, logits)
opt = tf.keras.optimizers.Adam()
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(loss=loss, optimizer=opt)
something = model.predict(te_data)
print(something[0].shape, something[1].shape)
This shows ((10000, 10), (10000,)) -- predict now returns a tuple of outputs, labels (this can be confirmed by inspecting the returned labels and comparing to the images in the test set).

Using Tensorflow Dataset from_generator() to create multi Input/Output with Custom Generator and ImageDataGenerator

I am trying to scale up my model which uses a "cluster loss" extension, the implementation works so far on MNIST, but I would like to benefit from data augmentation and multi-processing for the real dataset.
In short, the network follows works done with the "centre loss", which resemble a bit a Siamese Network. The important part of the architectures is that the model has 2 inputs and 2 outputs. Therefore, I implemented a custom generator in order to feed the model as follow:
def my_generator(stop):
i = 0
while i < stop:
batch = train_gen.next()
img = batch[0]
labels = batch[1]
labels_size = np.shape(labels)
cluster = np.zeros(labels_size)
x = [img, labels]
y = [labels, cluster]
yield x, y
i += 1
which calls the generator ("train_gen") defined as follow:
generator = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, horizontal_flip=True)
train_gen = generator.flow_from_dataframe(df, x_col='img_path', y_col='label',
class_mode='categorical',
target_size=(32, 32),
batch_size=batch_size)
The generator works if I set only one worker in the fit function. But obviously it's painfully slow... So I tried to use the recommended tf.Data from Tensorflow (tf.data.Dataset.from_generator) to fit my model, but setting it as follow,
ds = tf.data.Dataset.from_generator(my_generator,
args=[num_iter],
output_types=([tf.float32, tf.float32], [tf.float32, tf.float32]))
I got the following error:
TypeError: Cannot convert value [tf.float32, tf.float32] to a Tensorflow DType.
From there, I tried multiple things, following this post
For example, trying to return tuples instead of arrays:
x = (img, labels)
y = (labels, cluster)
But I got:
ValueError: as_list() is not defined on an unknown TensorShape
Does anyone have experience with this? I am not sure to understand the error and I am thinking that I could change the "output_types" argument perhaps, but TensorFlow has no "list" or "tuple" DType argument.
Here is a link to my code which construct a small image dataset from cifar10 to feed a toy model.
I do not think your generator works as you expect. Each time it is called it sets i=0. The code after
yield x, y
i += 1
i += 1 never executes. Put a print statement as below
yield x, y
i += 1
print ('the value of i is ',i)
and you will see it never executes.
The above is true if you execute
x,y=next(my_generator(2))
which is how generators are used. However if you execute
x,y=my_generator(2)
then the i += 1 statement does execute. Normally with generators you use them with next(my_generator). model.fit I believe gets the next batch by using next() on the generator you specify.

Why can't I use my dataset anymore after using InceptionV3?

I'm currently working on video-captioning (frame-sequence to natural language).
I recently started using tf.data.Dataset class instead of feed_dict argument in tensorflow.
My goal is to feed this frames to a pretrained CNN (inceptionv3), extract the feature vector and then feed it to my RNN seq2seq network.
I've got a problem of tensorflow types after mapping my Dataset with the inception model: the dataset is then totally unusable, neither via dataset.batch() or dataset.take(). I can't even make a one shot iterator !
Here is how I proceed to build my Dataset:
Step 1: I first extract the same number of frames for every videos. I store all of it into a numpy array. Its shape is (nb_videos, nb_frames, width, height, channels)
Note that in this dataset, every video has the same size and has 3 color channels.
Step 2: Then I create a tf.data.Dataset object using this big numpy array
Note that printing this dataset via python gives:
With n_videos=2; width=240; height=320; channels=3
I already don't understand what "DataAdapter" stands for
At this point; I can create a one shot iterator but using dataset.batch(1) returns:
I don't understand why "?" and not "1" shape..
Step 3: I use the map function on dataset to resize all the frames of all the videos to 299*299*3 (required to use InceptionV3)
At this point, I can use the data in my dataset and make a one shot iterator.
Step 4: I use the map function again to extract every features using InceptionV3 pretrained model.
The problem occurs at this point:
Printing the dataset gives:
Ok looks good
However, it's now impossible to make a one shot iterator for this dataset
Step1 :
X_train_slice, Y_train = build_dataset(number_of_samples)
Step 2:
X_train = tf.data.Dataset.from_tensor_slices(X_train_slice)
Step 3:
def format_video(video):
frames = tf.image.resize_images(video, (299,299))
frames = tf.keras.applications.inception_v3.preprocess_input(frames)
return frames
X_train = X_train.map(lambda video: format_video(video))
Step 4:
Inception model:
image_model = tf.keras.applications.InceptionV3(include_top=False,
weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output
image_features_extract_model = tf.keras.Model(new_input, hidden_layer)
For the tf.reduce_mean; see how-to-get-pool3-features-of-inception-v3-model-using-keras (SO)
def extract_video_features(video):
batch_features = image_features_extract_model(video)
batch_features = tf.reduce_mean(batch_features, axis=(1, 2))
return batch_features
X_train = X_train.map(lambda video: extract_video_features(video))
Creating the iterator:
iterator = X_train.make_one_shot_iterator()
Here is the output:
ValueError: Failed to create a one-shot iterator for a dataset.
`Dataset.make_one_shot_iterator()` does not support datasets that capture
stateful objects, such as a `Variable` or `LookupTable`. In these cases, use
`Dataset.make_initializable_iterator()`. (Original error: Cannot capture a
stateful node (name:conv2d/kernel, type:VarHandleOp) by value.)
I don't really get it: it asks me to use a initializable_iterator but this kind of iterator is dedicated for placeholder. Here, I've got raw data !
You're using the pipelines wrong.
The idea of tf.data is to provide input pipelines to a model, not to contain the model itself. What you're trying to do it fit the model as a step of the pipeline (your step 4), but, as the error shows, this won't work.
What you should do instead is build the model as you are doing and then call model.predict on the input data, to obtain the features you want (as computed values). If you want to add further computation, add it in the model, since the predict call will run the model and return the values of the output layers.
Side note: image_features_extract_model = tf.keras.Model(new_input, hidden_layer) is completely irrelevant, given the choice you made for input and output tensors: the input is image_model's input and the output is image_model's output, so image_features_extract_model is identical to image_model.
The final code should be:
X_train_slice, Y_train = build_dataset(number_of_samples)
X_train = tf.data.Dataset.from_tensor_slices(X_train_slice)
def format_video(video):
frames = tf.image.resize_images(video, (299,299))
frames = tf.keras.applications.inception_v3.preprocess_input(frames)
return frames
X_train = X_train.map(lambda video: format_video(video))
image_model = tf.keras.applications.InceptionV3(include_top=False,
weights='imagenet')
bottlenecks = image_model.predict(X_train)
# Do something with your bottlenecks

Extend word embedding layer for incremental word2vec training with Tensorflow

I'd like to train word vectors/embeddings incrementally. With each incremental run I want to extend the vocabulary of the model and add new rows to the embeddings matrix.
The embeddings matrix is a partitioned variable, so ideally I want to avoid using assign since it's not implemented for partitioned variables.
One way I tried, looks like this:
# Set prev_vocab_size and new_vocab_size
#accordingly to the corpus/text of the current run
prev_embeddings = tf.get_variable(
'prev_embeddings',
shape=[prev_vocab_size, FLAGS.embedding_size],
dtype=tf.float32,
initializer=tf.random_uniform_initializer(-1.0, 1.0)
)
new_embeddings = tf.get_variable(
'new_embeddings',
shape=[new_vocab_to_add,
FLAGS.embedding_size],
dtype=tf.float32,
initializer=tf.random_uniform_initializer(
-1.0, 1.0)
)
combined_embeddings = tf.concat(
[prev_embeddings, new_embeddings], 0)
embeddings = tf.Variable(
combined_embeddings,
expected_shape=[prev_vocab_size + new_vocab_to_add, FLAGS.embedding_size],
dtype=tf.float32,
name='embeddings')
Now, this works well for the first run. But if I do the second run, I get a Assign requires shapes of both tensors to match error because the restored original prev_embeddings variable (from the first run) doesn't match the new shape (based on the extended vocab) I declare in the second run.
So I modified the tf.train.Saver to save the new_embeddings as the prev_embeddings like this:
saver = tf.train.Saver({"prev_embeddings": new_embeddings})
Now, in the second run, the prev_embeddings has the shape of new_embeddings in the previous run and I don't get an error for this.
However, now the new_embeddings in the second run has a different shape than in the first run and therefore when restoring the variables from the first run, I get another Assign requires shapes of both tensors to match error.
What's the best way to extend/expand the embeddings variable incrementally with new vectors for new words in the vocabulary while keeping the old and trained vectors?
Any help would be much appreciated.

How to initialize a keras tensor employed in an API model

I am trying to implemente a Memory-augmented neural network, in which the memory and the read/write/usage weight vectors are updated according to a combination of their previous values. These weigths are different from the classic weight matrices between layers that are automatically updated with the fit() function! My problem is the following: how can I correctly initialize these weights as keras tensors and use them in the model? I explain it better with the following simplified example.
My API model is something like:
input = Input(shape=(5,6))
controller = LSTM(20, activation='tanh',stateful=False, return_sequences=True)(input)
write_key = Dense(4,activation='tanh')(controller)
read_key = Dense(4,activation='tanh')(controller)
w_w = Add()([w_u, w_r]) #<---- UPDATE OF WRITE WEIGHTS
to_write = Dot()([w_w, write_key])
M = Add()([M,to_write])
cos_sim = Dot()([M,read_key])
w_r = Lambda(lambda x: softmax(x,axis=1))(cos_sim) #<---- UPDATE OF READ WEIGHTS
w_u = Add()([w_u,w_r,w_w]) #<---- UPDATE OF USAGE WEIGHTS
retrieved_memory = Dot()([w_r,M])
controller_output = concatenate([controller,retrieved_memory])
final_output = Dense(6,activation='sigmoid')(controller_output)`
You can see that, in order to compute w_w^t, I have to have first defined w_r^{t-1} and w_u^{t-1}. So, at the beginning I have to provide a valid initialization for these vectors. What is the best way to do it? The initializations I would like to have are:
M = K.variable(numpy.zeros((10,4))) # MEMORY
w_r = K.variable(numpy.zeros((1,10))) # READ WEIGHTS
w_u = K.variable(numpy.zeros((1,10))) # USAGE WEIGHTS`
But, analogously to what said in #2486(entron), these commands do not return a keras tensor with all the needed meta-data and so this returns the following error:
AttributeError: 'NoneType' object has no attribute 'inbound_nodes'
I also thought to use the old M, w_r and w_u as further inputs at each iteration and analogously get in output the same variables to complete the loop. But this means that I have to use the fit() function to train online the model having just the target as final output (Model 1), and employ the predict() function on the model with all the secondary outputs (Model 2) to get the variables to use at the next iteration. I have also to pass the weigth matrices from Model 1 to Model 2 using get_weights() and set_weights(). As you can see, it becomes a little bit messy and too slow.
Do you have any suggestions for this problem?
P.S. Please, do not focus too much on the API model above because it is a simplified (almost meaningless) version of the complete one where I skipped several key steps.