I am building a model for human face segmentation into skin and non-skin area. As a model, I am using the model/method shown here as a starting point and adding a dense layer at the end with sigmoid activation. The model works very well for my purpose, giving good dice metric score. The model uses 2 pre-trained layers from Resnet50 as a model backbone for feature detection. I have read several articles, books and code but couldn't find any information on how to determine which layer to choses for feature extraction.
I compared the Resnet50 architecture with Xception and picked up two similar layers, replaced the layer in the original network (here) and ran the training. I got similar results, not better not worse.
I have the following questions
How to determine which layer is responsible for low-level/high-level features?
Does using only pre-trained layers any better than using full pre-trained networks in terms of training time and the number of trainable parameters?
where can I find more information about using only layers from pre-trained networks?
here is the code for quick over-view
def DeeplabV3Plus(image_size, num_classes):
model_input = keras.Input(shape=(image_size, image_size, 3))
resnet50 = keras.applications.ResNet50(
weights="imagenet", include_top=False, input_tensor=model_input)
x = resnet50.get_layer("conv4_block6_2_relu").output
x = DilatedSpatialPyramidPooling(x)
input_a = layers.UpSampling2D(size=(image_size // 4 // x.shape[1], image_size // 4 // x.shape[2]), interpolation="bilinear")(x)
input_b = resnet50.get_layer("conv2_block3_2_relu").output
input_b = convolution_block(input_b, num_filters=48, kernel_size=1)
x = layers.Concatenate(axis=-1)([input_a, input_b])
x = convolution_block(x)
x = convolution_block(x)
x = layers.UpSampling2D(size=(image_size // x.shape[1], image_size // x.shape[2]), interpolation="bilinear")(x)
model_output = layers.Conv2D(num_classes, kernel_size=(1, 1), padding="same")(x)
return keras.Model(inputs=model_input, outputs=model_output)
And here is my modified code using Xception layers as the backbone
def DeeplabV3Plus(image_size, num_classes):
model_input = keras.Input(shape=(image_size, image_size, 3))
Xception_model = keras.applications.Xception(
weights="imagenet", include_top=False, input_tensor=model_input)
xception_x1 = Xception_model.get_layer("block9_sepconv3_act").output
x = DilatedSpatialPyramidPooling(xception_x1)
input_a = layers.UpSampling2D(size=(image_size // 4 // x.shape[1], image_size // 4 // x.shape[2]), interpolation="bilinear")(x)
input_a = layers.AveragePooling2D(pool_size=(2, 2))(input_a)
xception_x2 = Xception_model.get_layer("block4_sepconv1_act").output
input_b = convolution_block(xception_x2, num_filters=256, kernel_size=1)
x = layers.Concatenate(axis=-1)([input_a, input_b])
x = convolution_block(x)
x = convolution_block(x)
x = layers.UpSampling2D(size=(image_size // x.shape[1], image_size // x.shape[2]),interpolation="bilinear")(x)
x = layers.Conv2D(num_classes, kernel_size=(1, 1), padding="same")(x)
model_output = layers.Dense(x.shape[2], activation='sigmoid')(x)
return keras.Model(inputs=model_input, outputs=model_output)
Thanks in advance!
In general, the first layers (the ones closer to the input) are the one responsible for learning high-level features, whereas the last layers are more dataset/task-specific. This is the reason why, when transfer learning, you usually want to delete only the last few layers to replace them with others which can deal with your specific problem.
It depends. Transfering the whole network, without deleting nor adding any new layer, basically means that the network won't learn anything new (unless you are not freezing the layers - in that case you are fine tuning). On the other hand, if you delete some layers and add a few more, than you the number of trainable parameters only depend on the new layers you just added.
What I suggest you to do is:
Delete a few layers from a pre-trained network, freeze these layers and add a few more layers (even just one)
Train the new network with a certain learning rate (usually this learning rate is not very low)
Fine Tune!: unfreeze all the layers, lower the learning rate, and re-train the whole network
Related
I'm trying to use LSTM networks to input a simple dataset that has multiple different sequences of numbers that represent musical data. The data is just a bunch of numpy arrays of floating point numbers with each song being one array. The data looks like this:
Song 1: [0.00013487907, 0.0002517006, 0.00021654845, ...]
Song 2: [-0.007279772, -0.011207076, -0.010082608, ...]
Song 3: [-0.00060827745, -0.00082834775, -0.0006534484, ...]
..and so on
I have done this before for MIDI files before, but those require embeddings of the different characters, however this is more continuous data as opposed to discrete data, so I'm not sure what the input model will look like, and how the data can be loaded for this particular task. For example, for the MIDI file project the input had an embedding layer to the model:
batch_size = 16
seq_length = 64
num_epochs = 100
optimizer_ = tf.keras.optimizers.Adam()
model = Sequential()
model.add(Embedding(input_dim = num_unique_chars, output_dim = 512, batch_input_shape = (batch_size, seq_length)))
model.add(LSTM(256, return_sequences = True, stateful = True))
model.add(Dropout(0.2))
model.add(LSTM(256, return_sequences = True, stateful = True))
model.add(Dropout(0.2))
model.add(LSTM(256, return_sequences = True, stateful = True))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(num_unique_chars)))
model.add(Activation("softmax"))
model.compile(loss = "categorical_crossentropy", optimizer = optimizer_, metrics = ["accuracy"])
I wanna know how to do the same without tokenization/embedding, and feed each song into the model separately, and then further be able to generate samples from it.
I've tried looking for examples of this but everything related to LSTM networks seems to be text-based. Would appreciate any help/guidance with this!
Thanks
If you already have continuous values, you will not need an Embedding-layer. Either you directly pass the data into the LSTMs or you can use a Dense layer in-between. Additionally, you can also add a Masking-layer (depending on your data).
Also you have to adjust the shape of your data to (batch_size, seq_len, 1) as you only have one feature, but the time-series has to be "recognizable".
Here is a minimum working example with a Dense-layer instead the non-functioning Embedding-layer:
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import Sequential
batch_size = 16
seq_length = 64
num_epochs = 100
num_unique_chars = 55 # I just picked any number
optimizer_ = tf.keras.optimizers.Adam()
model = Sequential()
model.add(layers.Dense(256, use_bias=False))
model.add(layers.LSTM(256, return_sequences = True, stateful = True))
model.add(layers.Dropout(0.2))
model.add(layers.LSTM(256, return_sequences = True, stateful = True))
model.add(layers.Dropout(0.2))
model.add(layers.LSTM(256, return_sequences = True, stateful = True))
model.add(layers.Dropout(0.2))
model.add(layers.TimeDistributed(layers.Dense(num_unique_chars)))
model.add(layers.Activation("softmax"))
model.compile(loss = "categorical_crossentropy", optimizer = optimizer_, metrics = ["accuracy"])
test_data = tf.random.normal(shape=(batch_size, seq_length, 1))
test_out = model(test_data)
print(test_out.shape)
Output: (16, 64, 55)
P. S.: With Dense layers the TimeDistributed-layer is optional. The Dense layer will just manipulate the last dimension of its input tensor.
P. P. S.: I think for your limited amount of features, three LSTM-layers with a dimension of 256 might easily result in over-fitting or some other unpleasant effects. So it might be useful to reduce the number of layers and their dimension. (Of course, this does not target your initial question)
I want to understand how to train a hugging face transformer model (like BERT, DistilBERT, etc) for the question-answer system and TensorFlow as backend. Following is the logic that I am currently using (but I am not sure whether is it right approach):
I am using SQuAD v1.1 dataset.
In SQuAd dataset answer to any question is always present in context. So to put in simple words I am trying to predict start index and end index and answer.
I have transformed the dataset for the same purpose. I have added the start index and end index of on word level after performing tokenization. Here is how my dataset looks,
Next I am encoded question and context as per hugging face docs guide and returing input_ids, attention_ids and token_type_ids; which will be used as input to model.
def tokenize(questions, contexts):
input_ids, input_masks, input_segments = [],[],[]
for question,context in tqdm_notebook(zip(questions, contexts)):
inputs = tokenizer.encode_plus(question,context, add_special_tokens=True, max_length=512, pad_to_max_length=True,return_attention_mask=True, return_token_type_ids=True )
input_ids.append(inputs['input_ids'])
input_masks.append(inputs['attention_mask'])
input_segments.append(inputs['token_type_ids'])
return [np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')]
Finally I define a Keras model which takes this three input and predict two value, start and end word index of answer from given context.
input_ids_in = tf.keras.layers.Input(shape=(512,), name='input_token', dtype='int32')
input_masks_in = tf.keras.layers.Input(shape=(512,), name='masked_token', dtype='int32')
input_segment_in = tf.keras.layers.Input(shape=(512,), name='segment_token', dtype='int32')
embedding_layer = transformer_model({'inputs':input_ids_in,'attention_mask':input_masks_in,
'token_type_ids':input_segment_in})[0]
X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(embedding_layer)
X = tf.keras.layers.GlobalMaxPool1D()(X)
start_branch = tf.keras.layers.Dense(1024, activation='relu')(X)
start_branch = tf.keras.layers.Dropout(0.3)(start_branch)
start_branch_output = tf.keras.layers.Dense(512, activation='softmax', name='start_branch')(start_branch)
end_branch = tf.keras.layers.Dense(1024, activation='relu')(X)
end_branch = tf.keras.layers.Dropout(0.3)(end_branch)
end_branch_output = tf.keras.layers.Dense(512, activation='softmax', name='end_branch')(end_branch)
model = tf.keras.Model(inputs=[input_ids_in, input_masks_in, input_segment_in], outputs = [start_branch_output, end_branch_output])
I am using last softmax layer with 512 units as it is my max no of words I my aim is to predict index dromit.
I am developing an autoencoder for clustering certain groups of images.
input_images->...->bottleneck->...->output_images
I have calibrated the autoencoder to my satisfaction and saved the model; everything has been developed using keras.tensorflow on python3.
The next step is to apply the autoencoder to a ton of images and cluster them according to cosine distance in the bottleneck layer. Oops, I just realized that I don't know the syntax in keras.tf for running the model on a batch up to a specific layer rather than to the output layer. Thus the question:
How do I run something like Model.predict_on_batch or Model.predict_generator up to the certain "bottleneck" layer and retrieve the values on that layer rather than the values on the output layer?
You need to define a new model (if you didn't define the encoder and decoder as separate models initially, which is usually the easiest option).
If your model was defined without reusing layers, it's just:
inputs = model.input
outputs= model.get_layer('bottleneck').output
encoder = Model(inputs, outputs)
Use the encoder model as any other model.
The full code would be like this,
# ENCODER
encoding_dim = 37310
input_layer = Input(shape=(encoding_dim,))
encoder = Dense(500, activation='tanh')(input_layer)
encoder = Dense(100, activation='tanh')(encoder)
encoder = Dense(50, activation='tanh', name='bottleneck_layer')(encoder)
decoder = Dense(100, activation='tanh')(encoder)
decoder = Dense(500, activation='tanh')(decoder)
decoder = Dense(37310, activation='sigmoid')(decoder)
# full model
model_full = models.Model(input_layer, decoder)
model_full.compile(optimizer='adam', loss='mse')
model_full.fit(x, y, epochs=20, batch_size=16)
# bottleneck model
bottleneck_output = model_full.get_layer('bottleneck_layer').output
model_bottleneck = models.Model(inputs = model_full.input, outputs = bottleneck_output)
bottleneck_predictions = model_bottleneck.predict(X_test)
I am trying to train an object detection model as described in this paper
There are 3 fully connected layers with 512, 512, 25 neurons. The 16x55x55 feature map from the last convolutional layer is fed into the fully connected layers to retrieve the appropriate class. At this stage, every grid described by (16x1x1) is fed into the fully connected layers to classify the grid as belonging to one of the 25 classes. The structure can be seen in the pciture below
fully connected layers
I am trying to adapt the code from TF MNIST classification tutorial, and I would like to know if it is okay to just sum the losses from each grid as in the code snippet below and use it to train the model weights.
flat_fmap = tf.reshape(last_conv_layer, [-1, 16*55*55])
total_loss = 0
for grid of flat_fmap:
dense1 = tf.layers.dense(inputs=grid, units=512, activation=tf.nn.relu)
dense2 = tf.layers.dense(inputs=dense1, units=512, activation=tf.nn.relu)
logits = tf.layers.dense(inputs=dense2, units=25)
total_loss += tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
train_op = optimizer.minimize(
loss=total_loss,
global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode=tf.estimator.ModeKeys.TRAIN, loss=total_loss, train_op=train_op)
In the code above, I think at every iteration 3 new layers are being creating. However, I would like the weights to be preserved when classifying one grid and then another.
Adding to the total_loss should be ok.
tf.losses.sparse_softmax_cross_entropy is also adding losses together.
It calculates a sparse_softmax with logits and then reduces the resulting array though a sum using math_ops.reduce_sum.
So you are adding them together, one way or another.
As you can see in its source
The for loop on the network declaration seems unusual, it probably makes more sense to do it at run time and pass each grid through the feed_dict.
dense1 = tf.layers.dense(inputs=X, units=512, activation=tf.nn.relu)
dense2 = tf.layers.dense(inputs=dense1, units=512, activation=tf.nn.relu)
logits = tf.layers.dense(inputs=dense2, units=25)
loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)
total_loss = 0
with tf.session as sess:
sess.run(init)
for grid in flat_fmap:
_, l = sess.run([optimizer,loss], feed_dict{X: grid, labels=labels})
total_loss += l
I'm trying to use VGG16 network to do image classification. I've tried two different ways to do it which should be approximately equivalent as far as I understand, yet the results are very different.
Method 1: Extract features using VGG16 and fit these features using a custom fully connected network. Here is the code:
model = vgg16.VGG16(include_top=False, weights='imagenet',
input_shape=(imsize,imsize,3),
pooling='avg')
model_pred = keras.Sequential()
model_pred.add(keras.layers.Dense(1024, input_dim=512, activation='sigmoid'))
model_pred.add(keras.layers.Dropout(0.5))
model_pred.add(keras.layers.Dense(512, activation='sigmoid'))
model_pred.add(keras.layers.Dropout(0.5))
model_pred.add(keras.layers.Dense(num_categories, activation='sigmoid'))
model_pred.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(), metrics=['accuracy'])
(xtr, ytr) = tools.extract_features(model, 3000, imsize, datagen,
rootdir + '/train',
pickle_name = rootdir + '/testpredstrain.pickle')
(xv, yv) = tools.extract_features(model, 300, imsize, datagen,
rootdir + '/valid1',
pickle_name = rootdir + '/testpredsvalid.pickle')
model_pred.fit(xtr, ytr, epochs = 10, validation_data = (xv, yv), verbose=1)
(The function extract_features() simply uses Keras ImageDataGenerator to generate sample images and returns the output after using model.predict() on those images)
Method 2: Take the VGG16 network without the top part, set all the convolutional layers to non-trainable and add a few densely connected layers that are trainable. Then fit using keras fit_generator(). Here is the code:
model2 = vgg16.VGG16(include_top=False, weights='imagenet',
input_shape=(imsize,imsize,3),
pooling='avg')
for ll in model2.layers:
ll.trainable = False
out1 = keras.layers.Dense(1024, activation='softmax')(model2.layers[-1].output)
out1 = keras.layers.Dropout(0.4)(out1)
out1 = keras.layers.Dense(512, activation='softmax')(out1)
out1 = keras.layers.Dropout(0.4)(out1)
out1 = keras.layers.Dense(num_categories, activation='softmax')(out1)
model2 = keras.Model(inputs = model2.input, outputs = out1)
model2.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
model2.fit_generator(train_gen,
steps_per_epoch = 100,
epochs = 10,
validation_data = valid_gen,
validation_steps = 10)
The number of epochs, samples, etc. are not exactly the same in both methods, but they don't need to be to notice the inconsistency: method 1 yields validation accuracy of 0.47 after just one epoch and gets as high as 0.7-0.8 and even better when I'm using larger number of samples to fit. Method 2, however, gets stuck at validation accuracy of 0.1-0.15 and never gets any better no matter how much I train.
Also, method 2 is considerably slower than method 1 even though it seems to me that they should be approximately as fast (when taking into account the time it takes to extract the features in method 1).
With your first method you extract features with vgg16 pre-trained model once and then you train - finetune your network while in your second approach you are constantly passing your images through every layer including vgg's layers at every epoch. That causes your model to run slower with your second method.