I am trying transfer learning I have data of input size (16657, 32, 32, 1) but I want to feed it into the model as input. I need a size of (16657, 32, 32, 3). How can I add 2 extra channels? though it's working fine in the conv2d model. but I want to apply it to other transfer learning models like vgg19,resnet50, etc.
You can just duplicate the existing channel into two additional dimensions.
Use a preprocessing function on the input images before feeding them to the network and define the function to stack the channel 3 times.
img = np.array([[12, 16,19], [124,25,19], [76,8,78]]) # shape (3,3)
stacked_img = np.stack((img,)*3, axis=0) # shape (3,3,3)
Related
I'm working on the "trigger word detection" model, and I decided to deploy the model to my phone.
The input shape of the model is (None, 5511, 101).
The output shape is (None, 1375, 1).
But in a real deployed App, the model can't get the 5511 timesteps all at once, instead the audio frame produced by the sensor of the phone is one by one.
How can I feed this pieces of data to the model one by one and get the output at each timestep?
The model is a recurrent one. But the "model.predict()" takes a first parameter of (None,5511,101), and what I intend to do is
output = []
for i in range(5511):
a = model.func(i, (None,1,101))
output.append(a)
structure of the model:
This problem can be solved by making the timesteps axis dynamic. In other words, when you define the model, the number of timesteps should be set to None. Here is an example illustrating how it would work for a simplified version of your model:
from keras.layers import GRU, Input, Conv1D
from keras.models import Model
import numpy as np
x = Input(shape=(None, 101))
h = Conv1D(196, 15, strides=4)(x)
h = GRU(1, return_sequences=True)(h)
model = Model(x, h)
# The model works for the original number of timesteps (5511)
batch_size = 2
out = model.predict(np.random.rand(batch_size, 5511, 101))
print(out.shape)
# ... but also for fewer timesteps (say 32)
out = model.predict(np.random.rand(batch_size, 32, 101))
print(out.shape)
# However, it will not work if timesteps < Conv1D filter_size (15)!
out = model.predict(np.random.rand(batch_size, 14, 101))
print(out.shape)
Note, however, that you will not be able to feed less than 15 timesteps (dimension of the Conv1D filters) unless you pad the input sequences to 15.
You should either change your model in a recurrent one where you can feed pieces of data one at a time or you should think about changing the model and using something that works on (overlapping) windows in time, where you apply the model every few pieces of data and get a partial output.
Still depending on the model you might get the output you want only at the end. You should design it accordingly.
Here is an example: https://hacks.mozilla.org/2018/09/speech-recognition-deepspeech/
For passing inputs step by step, you need recurrent layers with stateful=True.
The convolutional layer will certainly prevent you from achieving what you want. Either you remove it or you pass inputs in groups of 15 steps (where 15 is your kernel size for the convolution).
You would need to coordinate these 15 steps with stride 4, and might need a padding. If I may suggest, to avoid mathematical difficulties, you could use kernel_size=16, stride=4 and input_steps = 5512, this is a multiple of 4 which is your stride value. (This will avoid padding and allow easier calculations), and your output steps will be 1375 perfectly round.
Then your model would be like:
inputs = Input(batch_shape=(batch_size,None, 101)) #where you will always use input shapes of (batch_size, 16, 101)
out = Conv1D(196, 16, strides=4)(inputs)
...
...
out = GRU(..., stateful=True)(out)
...
out = GRU(..., stateful=True)(out)
...
...
model = Model(inputs, out)
It's necessary to have a fixed batch size with a stateful=True model. It can be 1, but for optimizing your processing speed, if you have more than one sequence to process in parallel (and independently from each other), use a bigger batch size.
For working it step by step, you need, first of all, to reset states (whenever you use a stateful=True model, you need to keep resetting states every time you are going to feed a new sequence or a new batch of parallel sequences).
So:
#will start a new batch containing a number of sequences equal to batch_size:
model.reset_states()
#received 16 steps from batch_size sequences:
steps = an_array_shaped((batch_size, 16, 101))
#for training
model.train_on_batch(steps, something_for_y_shaped((batch_size, 1, 1)), ...)
#I don't recommend to train like this because of the batch normalizations
#If you can train the entire length at once, do it.
#never forget: for full length training, you would need model.reset_states() every batch.
#for predicting:
predictions = model.predict_on_batch(steps, ...)
#received 4 new steps from X sequences:
steps = np.concatenate([steps[:,4:], new_steps], axis=1)
#these new steps belong to the "same" batch_size sequences! Don't call reset states!
#repeat one of the above for training or predicting
new_predictions = model.predict_on_batch(steps, ...)
predictions = np.concatenate([predictions, new_predictions], axis=1)
#keep repeating this loop until you reach the last step
Finally, when you reached the last step, for safety, call `model.reset_states()` again, everything that you input will be "new" sequences, not new "steps" or the previous sequences.
------------
# Training hint
If you are able to train with the full sequences (not step by step), use a `stateful=False` model, train normally with `model.fit(...)`, later you recreate the model exactly, but using `stateful=True`, copy the weights with `new_model.set_weights(old_model.get_weights())`, and use the new model for predicting like above.
I am new to Keras and I am trying to implement a model for an image captioning project.
I am trying to reproduce the model from Image captioning pre-inject architecture (The picture is taken from this paper: Where to put the image in an image captioning generator) (but with a minor difference: generating a word at each time step instead of only generating a single word at the end), in which the inputs for the LSTM at the first time step are the embedded CNN features. The LSTM should support variable input length and in order to do this I padded all the sequences with zeros so that all of them have maxlen time steps.
The code for the model I have right now is the following:
def get_model(model_name, batch_size, maxlen, voc_size, embed_size,
cnn_feats_size, dropout_rate):
# create input layer for the cnn features
cnn_feats_input = Input(shape=(cnn_feats_size,))
# normalize CNN features
normalized_cnn_feats = BatchNormalization(axis=-1)(cnn_feats_input)
# embed CNN features to have same dimension with word embeddings
embedded_cnn_feats = Dense(embed_size)(normalized_cnn_feats)
# add time dimension so that this layer output shape is (None, 1, embed_size)
final_cnn_feats = RepeatVector(1)(embedded_cnn_feats)
# create input layer for the captions (each caption has max maxlen words)
caption_input = Input(shape=(maxlen,))
# embed the captions
embedded_caption = Embedding(input_dim=voc_size,
output_dim=embed_size,
input_length=maxlen)(caption_input)
# concatenate CNN features and the captions.
# Ouput shape should be (None, maxlen + 1, embed_size)
img_caption_concat = concatenate([final_cnn_feats, embedded_caption], axis=1)
# now feed the concatenation into a LSTM layer (many-to-many)
lstm_layer = LSTM(units=embed_size,
input_shape=(maxlen + 1, embed_size), # one additional time step for the image features
return_sequences=True,
dropout=dropout_rate)(img_caption_concat)
# create a fully connected layer to make the predictions
pred_layer = TimeDistributed(Dense(units=voc_size))(lstm_layer)
# build the model with CNN features and captions as input and
# predictions output
model = Model(inputs=[cnn_feats_input, caption_input],
outputs=pred_layer)
optimizer = Adam(lr=0.0001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-8)
model.compile(loss='categorical_crossentropy',optimizer=optimizer)
model.summary()
return model
The model (as it is above) compiles without any errors (see: model summary) and I managed to train it using my data. However, it doesn't take into account the fact that my sequences are zero-padded and the results won't be accurate because of this. When I try to change the Embedding layer in order to support masking (also making sure that I use voc_size + 1 instead of voc_size, as it's mentioned in the documentation) like this:
embedded_caption = Embedding(input_dim=voc_size + 1,
output_dim=embed_size,
input_length=maxlen, mask_zero=True)(caption_input)
I get the following error:
Traceback (most recent call last):
File "/export/home/.../py3_env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1567, in _create_c_op
c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 200 and 1. Shapes are [200] and [1]. for 'concatenate_1/concat_1' (op: 'ConcatV2') with input shapes: [?,1,200], [?,25,1], [] and with computed input tensors: input[2] = <1>
I don't know why it says the shape of the second array is [?, 25, 1], as I am printing its shape before the concatenation and it's [?, 25, 200] (as it should be).
I don't understand why there'd be an issue with a model that compiles and works fine without that parameter, but I assume there's something I am missing.
I have also been thinking about using a Masking layer instead of mask_zero=True, but it should be before the Embedding and the documentation says that the Embedding layer should be the first layer in a model (after the input).
Is there anything I could change in order to fix this or is there a workaround to this ?
The non-equal shape error refers to the mask rather than the tensors/inputs. With concatenate supporting masking, it need to handle mask propagation. Your final_cnn_feats doesn't have mask (None), while your embedded_caption has a mask of shape (?, 25). You can find this out by doing:
print(embedded_caption._keras_history[0].compute_mask(caption_input))
Since final_cnn_feats has no mask, concatenate will give it a all non-zero mask for proper mask propagation. While this is correct, the shape of the mask, however, has the same shape as final_cnn_feats which is (?, 1, 200) rather than (?, 1), i.e. masking all features at all time step rather than just all time step. This is where the non-equal shape error comes from ((?, 1, 200) vs (?, 25)).
To fix it, you need to give final_cnn_feats a correct/matching mask. Now I'm not familiar with your project here. One option is to apply a Masking layer to final_cnn_feats, since it is designed to mask timestep(s).
final_cnn_feats = Masking()(RepeatVector(1)(embedded_cnn_feats))
This can be correct only when not all 200 features in final_cnn_feats are zero, i.e. there is always at least one non-zero value in final_cnn_feats. With that condition, Masking layer will give a (?, 1) mask and will not mask the single time step in final_cnn_feats.
I have a question about tflearn and the usage of CNN with it. I have a classification problem with n data variables (float) and m classes. I tried to implement this by using
https://github.com/tflearn/tflearn/blob/master/examples/nlp/cnn_sentence_classification.py this with just my dataset. But they used an embedding, which doesn't work for me (I have uncountable many possible values for the input values, since they are float). If I just drop the line network = tflearn.embedding(network, input_dim=10000, output_dim=128), I don't have a 3-d tensor as input for the following conv_1d layer. Could anyone help me? So how can I bring my data into the right shape to apply some convolutions?
Thanks!!
Add an extra dimension at the end, it needs a channel dimension, like so (changed the second like, the first and third I copied for easier understanding):
network = input_data(shape=[None, 100], name='input')
network = tf.expand_dims(network, 2)
branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
Recently, I try to use tensorflow to implement a cnn+ctc network base on the article Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks.
I try to feed batch spectrogram data (shape:(10,120,155,3),batch_size is 10) into 10 convolution layer and 3 fully connected layer. So the output before connecting the ctc layer is 2d data(shape:(10,1024)).
Here is my problem: I want to use tf.nn.ctc_loss function in tensorflow library,but it generate the ValueError: Dimension must be 2 but is 3 for 'transpose'(op:'Transpose') with input shapes:[?,1024],[3].
I guess the error is related to the dimension of my 2d input data. The discription of the ctc_loss function in tensorflow official site is require a 3d input with the shape (batch_size x max_time x num_classes).
So, what is the extra dimension of 'num_classes' ? what should I change the shape of my cnn+fc output data?
The fully connected layer should be applied per time step.
It's like applying same dense layer per time step in recurrent neural network.
For output of convolution layer, time step is width.
So for example, output shape would be:
convolution: (10,120,155,3) = (batch, height, width, channels)
flatten: (10, 155, 120*3) = (batch, max_time, features)
fully connected: (10, 155, 1024), (same dense layer applied per time step)
(10, 155, num_classes)
It is expected shape for ctc_loss in tensorflow.
Would the code below represent one or two layers? I'm confused because isn't there also supposed to be an input layer in a neural net?
input_layer = slim.fully_connected(input, 6000, activation_fn=tf.nn.relu)
output = slim.fully_connected(input_layer, num_output)
Does that contain a hidden layer? I'm just trying to be able to visualize the net. Thanks in advance!
You have a neural network with one hidden layer. In your code, input corresponds to the 'Input' layer in the above image. input_layer is what the image calls 'Hidden'. output is what the image calls 'Output'.
Remember that the "input layer" of a neural network isn't a traditional fully-connected layer since it's just raw data without an activation. It's a bit of a misnomer. Those neurons in the picture above in the input layer are not the same as the neurons in the hidden layer or output layer.
From tensorflow-slim:
Furthermore, TF-Slim's slim.stack operator allows a caller to repeatedly apply the same operation with different arguments to create a stack or tower of layers. slim.stack also creates a new tf.variable_scope for each operation created. For example, a simple way to create a Multi-Layer Perceptron (MLP):
# Verbose way:
x = slim.fully_connected(x, 32, scope='fc/fc_1')
x = slim.fully_connected(x, 64, scope='fc/fc_2')
x = slim.fully_connected(x, 128, scope='fc/fc_3')
# Equivalent, TF-Slim way using slim.stack:
slim.stack(x, slim.fully_connected, [32, 64, 128], scope='fc')
So the network mentioned here is a [32, 64,128] network - a layer with a hidden size of 64.