How transfer learning on EfficientNets work for grayscale images? - tensorflow

My question concerns more about how the algorithm work. I have successfully implemented EfficientNet integration and modelization for grayscale images and now I want to understand why it works.
Here the most important aspect is the grayscale and its 1 channel. When I put channels=1, the algorithm doesn't work because, if I understood right, it was made on 3-channel images. When I put channels=3 it works perfectly.
So my question is, when I put channels = 3 and feed the model with preprocessed images with channels=1, why it continues to work?
Code for EfficientNetB5
# Variable assignments
num_classes = 9
img_height = 84
img_width = 112
channels = 3
batch_size = 32
# Make the input layer
new_input = Input(shape=(img_height, img_width, channels),
# Download and use EfficientNetB5
tmp = tf.keras.applications.EfficientNetB5(include_top=False,
model = Sequential()
model.add(tmp) # adding EfficientNetB5
Code of preprocessing into grayscale
data_generator = ImageDataGenerator(
train_generator = data_generator.flow_from_directory(
target_size=(img_height, img_width),
color_mode="grayscale", ###################################

I dug into what happens when you give grayscale images to efficient net models with three-channel inputs.
Here are the first layers of Efficient Net B5 whose input_shape is (128,128,3)
Layer (type) Output Shape Param # Connected to
input_7 (InputLayer) [(None, 128, 128, 3 0 []
rescaling_7 (Rescaling) (None, 128, 128, 3) 0 ['input_7[0][0]']
normalization_13 (Normalizatio (None, 128, 128, 3) 7 ['rescaling_7[0][0]']
tf.math.truediv_4 (TFOpLambda) (None, 128, 128, 3) 0 ['normalization_13[0][0]']
stem_conv_pad (ZeroPadding2D) (None, 129, 129, 3) 0 ['tf.math.truediv_4[0][0]']
And here is the shape of the output of each of these layers when the model has as input a grayscale image:
input_7 (128, 128, 1)
rescaling_7 (128, 128, 1)
normalization_13 (128, 128, 3)
tf.math.truediv_4 (128, 128, 3)
stem_conv_pad (129, 129, 3)
As you can see, the number of channels of the output tensor switches from 1 to 3 when proceeding to the normalization_13 layer, so let's see what this layer is actually doing.
The Normalization layer is performing this operation on the input tensor:
(input_tensor - self.mean) / sqrt(self.var) // see
The number of channels changes after the subtraction. As a matter of fact, self.mean looks like this :
<tf.Tensor: shape=(1, 1, 1, 3), dtype=float32, numpy=array([[[[0.485, 0.456, 0.406]]]], dtype=float32)>
So self.mean has three channels and when performing the subtraction between a tensor with one channel and a tensor with three channels, the output looks like this: [firstTensor - secondTensorFirstChannel, firstTensor - secondTensorSecondChannel, firstTensor - secondTensorThirdChannel]
And this is how the magic happens and this is why the model can take as input grayscale images!
I have checked this with efficient net B5 and with efficient net B2V2. Even if they have differences in the way the Normalization layer is declared, the process is the same. I suppose that is also the case for the other efficient net models.
I hope it was clear enough!

This is interesting. If training still works with channels = 3 even though the input is grayscale, I would check the batch shape of the train_generator(maybe print a couple of batches to get a feel for it). Here is a code snippet to quickly check the batch shape. (plotImages() is available in Tensorflow docs)
imgs,labels = next(train_generator)
print('Batch shape: ',imgs.shape)


How to create joint loss with paired Dataset samples in Tensorflow Keras API?

I'm trying to train an autoencoder, with constraints that force one or more of the hidden/encoded nodes/neurons to have an interpretable value. My training approach uses paired images (though after training the model should operate on a single image) and utilizes a joint loss function that includes (1) the reconstruction loss for each of the images and (2) a comparison between values of the hidden/encoded vector, from each of the two images.
I've created an analogous simple toy problem and model to make this clearer. In the toy problem, the autoencoder is given a vector of length 3 as input. The encoding uses one dense layer to compute the mean (a scalar) and another dense layer to compute some other representation of the vector (given my construction, it will likely just learn an identity matrix, i.e., copy the input vector). See the figure below. The lowest node of the hidden layer is intended to compute the mean of the input vector. The rest of the hidden nodes are unconstrained aside from having to accommodate a reconstruction that matches the input.
The figure below exhibits how I wish to train the model, using paired images. "MSE" is mean-squared-error, although the identity of the actual function is not important for the question I'm asking here. The loss function is the sum of the reconstruction loss and the mean-estimation loss.
I've tried creating (1) a to generate paired vectors, (2) a Keras model, and (3) a custom loss function. However, I'm failing to understand how to do this correctly for this particular situation.
I can't get the to run correctly, and to associate the model outputs with the Dataset targets as intended. See code and errors below. Can anyone help? I've done many Google and stackoverflow searches and still don't understand how I can implement this.
import tensorflow as tf
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
DTYPE = tf.dtypes.float32
N_VEC = 3
def my_generator(n):
while True:
# Create two identical vectors of length, except with different means.
# An internal layer (single neuron) of the model should predict the
# mean of the input vector. To train it to do so, with paired
# vector inputs, use a loss function that penalizes incorrect
# predictions of the difference of the means of two input vectors.
input_vec1 = tf.random.normal((n,), dtype=DTYPE)
target_mean_diff = tf.random.normal((1,), dtype=DTYPE)
input_vec2 = input_vec1 + target_mean_diff
# Model is a constrained autoencoder. Output targets are
# identical to the input vectors. Including them as explicit
# targets in this generator, for generalization.
target_vec1 = tf.identity(input_vec1)
target_vec2 = tf.identity(input_vec2)
yield ({'input_vec1':input_vec1,
def my_dataset(n, batch_size=4):
ds =,
output_signature=({'input_vec1':tf.TensorSpec(shape=(n,), dtype=DTYPE),
'input_vec2':tf.TensorSpec(shape=(n,), dtype=DTYPE)},
{'target_vec1':tf.TensorSpec(shape=(n,), dtype=DTYPE),
'target_vec2':tf.TensorSpec(shape=(n,), dtype=DTYPE),
'target_mean_diff':tf.TensorSpec(shape=(1,), dtype=DTYPE)}),
ds = ds.batch(batch_size)
return ds
## Do a brief test using the Dataset
ds = my_dataset(N_VEC, batch_size=4)
ds_iter = iter(ds)
dict_inputs, dict_targets = next(ds_iter)
## Define the Model
layer_encode_vec = tf.keras.layers.Dense(N_VEC, activation=None, name='encode_vec')
layer_decode_vec = tf.keras.layers.Dense(N_VEC, activation=None, name='decode_vec')
layer_encode_mean = tf.keras.layers.Dense(1, activation=None, name='encode_mean')
layer_decode_mean = tf.keras.layers.Dense(N_VEC, activation=None, name='decode_mean')
input1 = tf.keras.Input(shape=(N_VEC,), name='input_vec1')
input2 = tf.keras.Input(shape=(N_VEC,), name='input_vec2')
vec_encoded1 = layer_encode_vec(input1)
vec_encoded2 = layer_encode_vec(input2)
mean_encoded1 = layer_encode_mean(input1)
mean_encoded2 = layer_encode_mean(input2)
mean_diff = mean_encoded2 - mean_encoded1
pred_vec1 = layer_decode_vec(vec_encoded1) + layer_decode_mean(mean_encoded1)
pred_vec2 = layer_decode_vec(vec_encoded2) + layer_decode_mean(mean_encoded2)
model = tf.keras.Model(inputs=[input1, input2], outputs=[pred_vec1, pred_vec2, mean_diff])
## Define the joint loss function
def loss_total(y_true, y_pred):
loss_reconstruct = tf.reduce_mean(tf.keras.MSE(y_true[0], y_pred[0]))/2 + \
tf.reduce_mean(tf.keras.MSE(y_true[1], y_pred[1]))/2
loss_mean = tf.reduce_mean(tf.keras.MSE(y_true[2], y_pred[2]))
return loss_reconstruct + loss_mean
## Compile model
optimizer = tf.keras.optimizers.Adam(lr=0.01)
model.compile(optimizer=optimizer, loss=loss_total)
## Train model
history =, epochs=10, steps_per_epoch=10)
Output: Example batch from the Dataset:
{'input_vec1': <tf.Tensor: shape=(4, 3), dtype=float32, numpy=
array([[-0.53022575, -0.02389329, 0.32843253],
[-0.61793506, -0.8276422 , -1.3469328 ],
[-0.5401968 , 0.3141346 , -1.3638284 ],
[-1.2189807 , 0.23848908, 0.75108534]], dtype=float32)>, 'input_vec2': <tf.Tensor: shape=(4, 3), dtype=float32, numpy=
array([[-0.23415083, 0.27218163, 0.6245074 ],
[-0.57636774, -0.7860749 , -1.3053654 ],
[ 0.65463066, 1.508962 , -0.16900098],
[-0.49326736, 0.9642024 , 1.4767987 ]], dtype=float32)>}
{'target_vec1': <tf.Tensor: shape=(4, 3), dtype=float32, numpy=
array([[-0.53022575, -0.02389329, 0.32843253],
[-0.61793506, -0.8276422 , -1.3469328 ],
[-0.5401968 , 0.3141346 , -1.3638284 ],
[-1.2189807 , 0.23848908, 0.75108534]], dtype=float32)>, 'target_vec2': <tf.Tensor: shape=(4, 3), dtype=float32, numpy=
array([[-0.23415083, 0.27218163, 0.6245074 ],
[-0.57636774, -0.7860749 , -1.3053654 ],
[ 0.65463066, 1.508962 , -0.16900098],
[-0.49326736, 0.9642024 , 1.4767987 ]], dtype=float32)>, 'target_mean_diff': <tf.Tensor: shape=(4, 1), dtype=float32, numpy=
[1.1948274 ],
[0.7257133 ]], dtype=float32)>}
Output: The model summary:
Model: "model"
Layer (type) Output Shape Param # Connected to
input_vec1 (InputLayer) [(None, 3)] 0
input_vec2 (InputLayer) [(None, 3)] 0
encode_vec (Dense) (None, 3) 12 input_vec1[0][0]
encode_mean (Dense) (None, 1) 4 input_vec1[0][0]
decode_vec (Dense) (None, 3) 12 encode_vec[0][0]
decode_mean (Dense) (None, 3) 6 encode_mean[0][0]
tf.__operators__.add (TFOpLambd (None, 3) 0 decode_vec[0][0]
tf.__operators__.add_1 (TFOpLam (None, 3) 0 decode_vec[1][0]
tf.math.subtract (TFOpLambda) (None, 1) 0 encode_mean[1][0]
Total params: 34
Trainable params: 34
Non-trainable params: 0
Output: The error message when calling
Epoch 1/10
ValueError Traceback (most recent call last)
ValueError: Found unexpected keys that do not correspond to any
Model output: dict_keys(['target_vec1', 'target_vec2', 'target_mean_diff']).
Expected: ['tf.__operators__.add', 'tf.__operators__.add_1', 'tf.math.subtract']
You can pass a dict to Model for both inputs and outputs like so:
model = tf.keras.Model(
inputs={"input_vec1": input1, "input_vec2": input2},
"target_vec1": pred_vec1,
"target_vec2": pred_vec2,
"target_mean_diff": mean_diff,
which avoids having to name the output layers.
For the losses, it's currently applying loss_total to each of the 3 outputs individually and summing to get the final loss, which is not what you want. So you can either break out each of the losses individually:
loss={"target_vec1": "mse", "target_vec2": "mse", "target_mean_diff": "mse"},
loss_weights={"target_vec1": 0.5, "target_vec2": 0.5, "target_mean_diff": 1},
or you can manually train the model using a modified loss function that takes dict input. Something like:
def loss_total(y_true, y_pred):
loss_reconstruct = (
tf.reduce_mean(tf.keras.losses.MSE(y_true["target_vec1"], y_pred["target_vec1"])) / 2
+ tf.reduce_mean(tf.keras.losses.MSE(y_true["target_vec2"], y_pred["target_vec2"])) / 2
loss_mean = tf.reduce_mean(tf.keras.losses.MSE(y_true["target_mean_diff"], y_pred["target_mean_diff"]))
return loss_reconstruct + loss_mean
for epoch in range(10):
for batch, (x, y) in zip(range(10), ds):
with tf.GradientTape() as tape:
outputs = model(x, training=True)
loss = loss_total(y, outputs)
trainable_vars = model.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
optimizer.apply_gradients(zip(gradients, trainable_vars))
print(f"Batch: {batch}, loss: {loss.numpy()}")

How to replace the input channel shape from (224, 224, 3) to (224, 224, 1) in VGG16?

I am using VGG16 for transfer learning. My images are grayscale. So, I need to change the input channel shape of Vgg16 from (224, 224, 3) to (224, 224, 1). I tried the following code and got error:
TypeError: build() takes from 1 to 2 positional arguments but 4 were given
Can anyone help me where Am I doing it wrong?
vgg16_model= load_model('Fetched_VGG.h5')
# transform the model to Sequential
model= Sequential()
for layer in vgg16_model.layers[1:-1]:
# Freezing the layers (Oppose weights to be updated)
for layer in model.layers:
layer.trainable = False,224,1)
model.add(Dense(2, activation='softmax', name='predictions'))
you can't, even if you get rid of the input layer, this model has a graph that has already been compiled and your first conv layer expects an input with 3 channels. I don't think there is really an easy work around to make it accept 1 channel if there is any at all.
you need to repeat your data in third dimension and have the same grayscale image in all 3 bands instead of RGB, that works just fine.
if your image has the shape of : (224,224,1):
import numpy as np
gray_image_3band = np.repeat(gray_img, repeats = 3, axis = -1)
if your image has the shape of : (224,224)
gray_image_3band = np.repeat(gray_img[..., np.newaxis], repeats = 3, axis = -1)
you don't need to call the anymore this way, keep the input layer. but if you ever wanted to call it you need to pass the shape as a tuple like this: (224, 224, 1) ) # this is correct, notice the parentheses

How to Feed Batched Sequences of Images through Tensorflow conv2d

This seems like a trivial question, but I've been unable to find the answer.
I have batched sequences of images of shape:
[batch_size, number_of_frames, frame_height, frame_width, number_of_channels]
and I would like to pass each frame through a few convolutional and pooling layers. However, TensorFlow's conv2d layer accepts 4D inputs of shape:
[batch_size, frame_height, frame_width, number_of_channels]
My first attempt was to use tf.map_fn over axis=1, but I discovered that this function does not propagate gradients.
My second attempt was to use tf.unstack over the first dimension and then use tf.while_loop. However, my batch_size and number_of_frames are dynamically determined (i.e. both are None), and tf.unstack raises {ValueError} Cannot infer num from shape (?, ?, 30, 30, 3) if num is unspecified. I tried specifying num=tf.shape(self.observations)[1], but this raises {TypeError} Expected int for argument 'num' not <tf.Tensor 'A2C/infer/strided_slice:0' shape=() dtype=int32>.
Since all the images (num_of_frames) are passed to the same convolutional model, you can stack both batch and frames together and do the normal convolution. Can be achieved by just using tf.resize as shown below:
# input with size [batch_size, frame_height, frame_width, number_of_channels
x = tf.placeholder(tf.float32,[None, None,32,32,3])
# reshape for the conv input
x_reshapped = tf.reshape(x,[-1, 32, 32, 3])
x_reshapped output size will be (50, 32, 32, 3)
# define your conv network
y = tf.layers.conv2d(x_reshapped,5,kernel_size=(3,3),padding='SAME')
#(50, 32, 32, 3)
#Get back the input shape
out = tf.reshape(x,[-1, tf.shape(x)[1], 32, 32, 3])
The output size would be same as the input: (10, 5, 32, 32, 3
with tf.Session() as sess:
print(, {x:np.random.normal(size=(10,5,32,32,3))}).shape)
#(10, 5, 32, 32, 3)

Data Preprocessing - Input Shape for TimeDistributed CNN (LRCN) & ConvLSTM2D for Video Classification

I'm trying to do binary classification for labeled data for 300+ videos. The goal is to extract features using a ConvNet and feed into to an LSTM for sequencing with a binary output after evaluating all the frames in the video. I've preprocessed each video to have exactly 200 frames with each image being 256 x 256 so that it would be easier to feed into a DNN and split the dataset into two folders as labels. (e.g. dog and cat)
However, after searching stackoverflow for hours, I'm still unsure how to reshape the dataset of video frames so that the model accounts for the number of frames. I'm trying to feed the video frames into a 3D ConvNets and TimeDistributed (2DConvNets) + LSTM, (e.g. (300, 200, 256, 256, 3) ) with no luck. I'm able to perform 2D ConvNet classification (data is a 4D Tensor, need to add a time step dimension to make it a 5D Tensor
) pretty easily but now having issues wrangling with the temporal aspect.
I've been using Keras ImageDataGenerator and train_datagen.flow_from_directory to read in the images and have been running into shape mismatch errors when I attempt to feed it to a TimeDistributed ConvNet. I know hypothetically if I have a X_train dataset I can potentially do X_train = X_train.reshape(...). Any example code would be very much appreciated.
I think you could use ConvLSTM2D in Keras for your purpose. ImageDataGenerator is very good for CNN with images, but may be not convenient for CRNN with videos.
You have already transformed your 300 videos data in the same shape (200, 256, 256, 3), each video 200 frames, each frame 256x256 rgb. Next, you need to load them in a numpy array in shape (300, 200, 256, 256, 3). For reading videos in numpy arrays see this answer.
Then you can feed the data in a CRNN. Its first ConvLSTM2D layer should have input_shape = (None, 200, 256, 256, 3).
A sample according to your data: (only illustrated and not tested)
from keras.models import Sequential
from keras.layers import Dense
from keras.layers.convolutional_recurrent import ConvLSTM2D
model = Sequential()
model.add(ConvLSTM2D(filters = 32, kernel_size = (5, 5), input_shape = (None, 200, 256, 256, 3)))
### model.add(...more layers)
model.add(Dense(units = num_of_categories, # num of your vedio categories
kernel_initializer = 'Orthogonal', activation = 'softmax'))
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
# then train it, # shape (300, 200, 256, 256, 3)
[list of categories],
batch_size = 20,
epochs = 50,
validation_split = 0.1)
I hope this could be a little helpful.

TFLearn LSTM Time Series Classification

I am trying to build an LSTM network which takes a sequence and classifies the last time step in each sequence.
This is what I have so far:
net = tf.input_data(shape=[None, 64, 17])
net = tf.lstm(net, 128, dropout=[.2,.8], return_seq=True)
net = tf.lstm(net, 128, dropout=[.2,.8], return_seq=True)
net = tf.lstm(net, 128, dropout=[.2,.8])
net = tf.fully_connected(net, 3, activation='softmax')
net = tf.regression(net, optimizer='adam', learning_rate=0.01, loss='categorical_crossentropy')
model = tf.DNN(net, tensorboard_verbose=0), trainY, validation_set=(testX,testY), show_metric=True, batch_size=None)
My data has been shaped into a large number of sequences with each being 64 timesteps long. each timestep has 17 features. The first sequence being timesteps 0 to 63, the second being timesteps 1 to 64, etc.
The network builds just fine, but in the fit method I get this error:
'ValueError: Cannot feed value of shape (64,17) for Tensor
'InputData/X:0', which has shape (?,64,17)
Anyone has a suggestion as to my problem?
It's not in your snippet, but it looks like trainX has the shape (64, 17). If so, you should reshape it o a batch of size 1:
trainX = np.expand_dims(trainX, 0) # now it's [1, 64, 17]
The same for testX.