Is sigmoid function only applicable after dense() layer? - tensorflow

I am making a network which is similar to SE-Net(https://github.com/titu1994/keras-squeeze-excite-network/blob/master/se.py)
using keras, but quite different with it.
Suppose that I want to make some layer sequence like :
import keras
Input = keras.model.Input((None,None,3))
x1 = keras.layers.Conv2d(filters = 32, kernel_size = (3,3))(Input)
x_gp = keras.layers.GlobalAveragePooling()(x1)
x2 = keras.layers.Conv2d(filters = 32, kernel_size = (1,1))(x_gp)
x3 = keras.layers.Conv2d(filters = 8, kernel_size = (1,1))(x2)
x2_ = keras.layers.Conv2d(filters = 32, kernel_size = (1,1))(x3)
x_se = keras.activation.sigmoid()(x2_)
I want to know that applying x_se like this is programmable. Please tell me if I am doing wrong.

you can for sure experiment sigmoid as an activation for cnn layers too but the reason why sigmoid is not used with cnn layers are:
1. Sigmoid function is monotonic but it's derivative is not therefore there is a possibility that your training can be stuck
2. Sigmoid range:[0,1]
if you are experimenting sigmoid with cnn layers then I would suggest you to use it only for few layers.
You can give swish a try.

Related

RNN LSTM network for inputting a sequence of numbers

I'm trying to use LSTM networks to input a simple dataset that has multiple different sequences of numbers that represent musical data. The data is just a bunch of numpy arrays of floating point numbers with each song being one array. The data looks like this:
Song 1: [0.00013487907, 0.0002517006, 0.00021654845, ...]
Song 2: [-0.007279772, -0.011207076, -0.010082608, ...]
Song 3: [-0.00060827745, -0.00082834775, -0.0006534484, ...]
..and so on
I have done this before for MIDI files before, but those require embeddings of the different characters, however this is more continuous data as opposed to discrete data, so I'm not sure what the input model will look like, and how the data can be loaded for this particular task. For example, for the MIDI file project the input had an embedding layer to the model:
batch_size = 16
seq_length = 64
num_epochs = 100
optimizer_ = tf.keras.optimizers.Adam()
model = Sequential()
model.add(Embedding(input_dim = num_unique_chars, output_dim = 512, batch_input_shape = (batch_size, seq_length)))
model.add(LSTM(256, return_sequences = True, stateful = True))
model.add(Dropout(0.2))
model.add(LSTM(256, return_sequences = True, stateful = True))
model.add(Dropout(0.2))
model.add(LSTM(256, return_sequences = True, stateful = True))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(num_unique_chars)))
model.add(Activation("softmax"))
model.compile(loss = "categorical_crossentropy", optimizer = optimizer_, metrics = ["accuracy"])
I wanna know how to do the same without tokenization/embedding, and feed each song into the model separately, and then further be able to generate samples from it.
I've tried looking for examples of this but everything related to LSTM networks seems to be text-based. Would appreciate any help/guidance with this!
Thanks
If you already have continuous values, you will not need an Embedding-layer. Either you directly pass the data into the LSTMs or you can use a Dense layer in-between. Additionally, you can also add a Masking-layer (depending on your data).
Also you have to adjust the shape of your data to (batch_size, seq_len, 1) as you only have one feature, but the time-series has to be "recognizable".
Here is a minimum working example with a Dense-layer instead the non-functioning Embedding-layer:
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import Sequential
batch_size = 16
seq_length = 64
num_epochs = 100
num_unique_chars = 55 # I just picked any number
optimizer_ = tf.keras.optimizers.Adam()
model = Sequential()
model.add(layers.Dense(256, use_bias=False))
model.add(layers.LSTM(256, return_sequences = True, stateful = True))
model.add(layers.Dropout(0.2))
model.add(layers.LSTM(256, return_sequences = True, stateful = True))
model.add(layers.Dropout(0.2))
model.add(layers.LSTM(256, return_sequences = True, stateful = True))
model.add(layers.Dropout(0.2))
model.add(layers.TimeDistributed(layers.Dense(num_unique_chars)))
model.add(layers.Activation("softmax"))
model.compile(loss = "categorical_crossentropy", optimizer = optimizer_, metrics = ["accuracy"])
test_data = tf.random.normal(shape=(batch_size, seq_length, 1))
test_out = model(test_data)
print(test_out.shape)
Output: (16, 64, 55)
P. S.: With Dense layers the TimeDistributed-layer is optional. The Dense layer will just manipulate the last dimension of its input tensor.
P. P. S.: I think for your limited amount of features, three LSTM-layers with a dimension of 256 might easily result in over-fitting or some other unpleasant effects. So it might be useful to reduce the number of layers and their dimension. (Of course, this does not target your initial question)

Keras accuracy not increasing

I am trying to perform sentiment classification using Keras. I am trying to do this using a basic neural network (no RNN or other more complex type). However when I run the script I see no increase in accuracy during training/evaluation. I am guessing I am setting up the output layer incorrectly but I am not sure of that. y_train is a list [1,2,3,1,2,4,5] (5 different labels) containing the targets belonging to the features in X_train_seq_padded. The setup is as follows:
padding_len = 24 # len of each tokenized sentence
neurons = 16 # 2/3 the length of the text that is padded
model = Sequential()
model.add(Dense(neurons, input_dim = padding_len, activation = 'relu', name = 'hidden-1'))
model.add(Dense(neurons, activation = 'relu', name = 'hidden-2'))
model.add(Dense(neurons, activation = 'relu', name = 'hidden-3'))
model.add(Dense(1, activation = 'sigmoid', name = 'output_layer'))
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics=['accuracy'])
callbacks = [EarlyStopping(monitor = 'accuracy', patience = 5, mode = 'max')]
history = model.fit(X_train_seq_padded, y_train, epochs = 100, batch_size = 64, callbacks = callbacks)
First of all, in your above set up if you choose sigmoid in your last layer activation function which generally uses for binary classification or multi-label classification then, the loss function should be binary_crossentropy.
But if your labels are represented multi-class and transformed into one-hot encoded then your last layer should be Dense(num_classes, activations='softmax') and the loss function would be categorical_crossentropy.
But if you don't transform your multi-class label but integer then your last layer and loss function should be
Dense(num_classes) # with logits
SparseCategoricalCrossentropy(from_logits= True)
Or, (#Frightera)
Dense(num_classes, activation='softmax') # with probabilities
SparseCategoricalCrossentropy(from_logits=False)

What does recurrent_initializer do?

I am experimenting with recurrent neural network layers in tensorflow & keras and I am having a look at the recurrent_initializer. I wanted to know more about its influence on the layer, so I created a SimpleRnn layer as the follows:
rnn_layer = keras.layers.SimpleRNN(1, return_sequences=True, kernel_initializer = keras.initializers.ones, recurrent_initializer=keras.initializers.zeros, activation="linear")
Running this code, makes the addition in the recurrent net visible:
inp = np.zeros(shape=(1,1,20), dtype=np.float32)
for i in range(20):
inp[0][0][:i] = 5
#inp[0][0][i:] = 0
print(f"i:{i} {rnn_layer(inp)}"'')
output:
i:0 [[[0.]]]
i:1 [[[5.]]]
i:2 [[[10.]]]
i:3 [[[15.]]]
i:4 [[[20.]]]
i:5 [[[25.]]]
i:6 [[[30.]]]
i:7 [[[35.]]]
i:8 [[[40.]]]
i:9 [[[45.]]]
i:10 [[[50.]]]
i:11 [[[55.]]]
i:12 [[[60.]]]
i:13 [[[65.]]]
i:14 [[[70.]]]
i:15 [[[75.]]]
i:16 [[[80.]]]
i:17 [[[85.]]]
i:18 [[[90.]]]
i:19 [[[95.]]]
Now I change the recurrent_initializer to something different, like a glorot_normal distribution:
rnn_layer = keras.layers.SimpleRNN(1, return_sequences=True, kernel_initializer = keras.initializers.ones, recurrent_initializer=keras.initializers.glorot_normal(seed=0), activation="linear")
But I still get the same results. I thought it might depend on some logic, which a Rnn is missing but a LSTM has, so I tried it with an lstm, but still same results. I guess there is something about the recurrent_logic, I still miss. Can someone explain me, what the reccurent_initializers purpose is and how it affects the recurrent layer?
Thanks alot!
Your input to the RNN layer is of shape (1, 1, 20), which mean one Timestep for each batch , the default behavior of RNN is to RESET state between each batch , so you cant see the effect of the recurrent ops(the recurrent_initializers).
You have to change the length of the sequence of your input:
inp = np.ones(shape=(5 ,4,1), dtype=np.float32) # sequence length == 4
rnn_layer1 = tf.keras.layers.LSTM(1,return_state=True, return_sequences=False,
kernel_initializer = tf.keras.initializers.ones,
recurrent_initializer=tf.keras.initializers.zeros, activation="linear")
rnn_layer2 = tf.keras.layers.LSTM(1,return_state=True , return_sequences=False,
kernel_initializer = tf.keras.initializers.ones,
recurrent_initializer=tf.keras.initializers.glorot_normal(seed=0),
activation="linear")
first_sample = inp[0 : 1 , : ,: ] #shape(1,4,1)
print(rnn_layer1(first_sample )
print(rnn_layer2(first_sample )

How to compute saliency map using keras backend

I am trying to construct a basic "vanilla gradient" saliency heatmap (gradient-based feature attribution) for MNIST using keras. I know there are libraries such as this one to compute saliency heatmaps, but I would like to construct this from scratch since the vanilla gradient approach seems conceptually straightforward to implement. I have trained the following digit classifier in Keras using functional model definition:
input = layers.Input(shape=(28,28,1), name='input')
conv2d_1 = layers.Conv2D(32, kernel_size=(3, 3), activation='relu')(input)
maxpooling2d_1 = layers.MaxPooling2D(pool_size=(2, 2), name='maxpooling2d_1')(conv2d_1)
conv2d_2 = layers.Conv2D(64, kernel_size=(3, 3), activation='relu')(maxpooling2d_1)
maxpooling2d_2 = layers.MaxPooling2D(pool_size=(2, 2))(conv2d_2)
flatten = layers.Flatten(name='flatten')(maxpooling2d_2)
dropout = layers.Dropout(0.5, name='dropout')(flatten)
dense = layers.Dense(num_classes, activation='softmax', name='dense')(dropout)
model = keras.models.Model(inputs=input, outputs=dense)
Now, I want to compute the saliency map for a single MNIST image. Since the final layer has a softmax activation and the denominator is a normalization term (so that the output nodes add up to 1), I believe that I need to either take the pre-softmax output or change the activation of the trained model linear for computing saliency maps. I will do the latter.
model.layers[-1].activation = tf.keras.activations.linear # swap activation to linear
input = loaded_model.layers[0].input
output = loaded_model.layers[-1].output
input_image = x_test[0] # shape is (28, 28, 1)
pred = np.argmax(loaded_model.predict(np.expand_dims(input_image, axis=0))) # predicted class
However, I am not sure what to do beyond this. I know I can use the following K.gradients(output, input) to compute gradients. That being said, I believe I should compute the gradient of the predicted class with respect to the input image, versus computing the gradient of the entire output. How would I do this? Also, I'm not sure how to evaluate the saliency heatmap for a specific image/prediction. I imagine I will have to use sess = tf.keras.backend.get_session() and sess.run(), but not sure exactly. I would greatly appreciate any help with completing the saliency heatmap code. Thanks!
If you add the activation as a single layer after the last dense layer with:
keras.layers.Activation('softmax')
you can do:
linear_model = keras.Model(input=model, output=model.layers[-2].output)
To then compute the gradients like:
def get_saliency_map(model, image, class_idx):
with tf.GradientTape() as tape:
tape.watch(image)
predictions = model(image)
loss = predictions[:, class_idx]
# Get the gradients of the loss w.r.t to the input image.
gradient = tape.gradient(loss, image)
# take maximum across channels
gradient = tf.reduce_max(gradient, axis=-1)
# convert to numpy
gradient = gradient.numpy()
# normaliz between 0 and 1
min_val, max_val = np.min(gradient), np.max(gradient)
smap = (gradient - min_val) / (max_val - min_val + keras.backend.epsilon())
return smap

Keras Conv2DTranspose layers in Convolutional GAN

I'm trying to train a Convolutional GAN in Keras with Tensorflow backend for generating faces. Having read several examples there seem to be two ways to build the generator, you can either use the Conv2DTranspose layer with strides to upsample, like so:
def build_generator(seed_size, channels):
inputs_rand = Input(shape=(seed_size,))
inputs_feat = Input(shape=(NUM_FEATS,))
inputs = Concatenate()([inputs_rand, inputs_feat])
dense1 = Dense(4*4*64, activation='relu')(inputs)
reshape1 = Reshape((4,4,64))(dense1)
conv_trans1 = Conv2DTranspose(64, kernel_size=5, strides=2*GENERATE_RES, padding='same')(reshape1)
batch_norm1 = BatchNormalization(momentum=0.8)(conv_trans1)
leaky_relu1 = ReLU()(batch_norm1)
conv_trans2 = Conv2DTranspose(64, kernel_size=5, strides=2, padding='same')(leaky_relu1)
batch_norm2 = BatchNormalization(momentum=0.8)(conv_trans2)
leaky_relu2 = ReLU()(batch_norm2)
conv_trans3 = Conv2DTranspose(64, kernel_size=5, strides=2, padding='same')(leaky_relu2)
batch_norm3 = BatchNormalization(momentum=0.8)(conv_trans3)
leaky_relu3 = ReLU()(batch_norm3)
output = Conv2DTranspose(channels, kernel_size=3, padding='same', activation='tanh')(leaky_relu3)
generator = Model(inputs=[inputs_rand, inputs_feat], outputs=[output, inputs_feat])
return generator
or use the Upsample2D layer along with Conv2D layers, like so:
def build_generator(seed_size, channels):
inputs_rand = Input(shape=(seed_size,))
inputs_feat = Input(shape=(NUM_FEATS,))
inputs = Concatenate()([inputs_rand, inputs_feat])
dense1 = Dense(4*4*64, activation='relu')(inputs)
reshape1 = Reshape((4,4,64))(dense1)
upsamp1 = UpSampling2D(2*GENERATE_RES)(reshape1)
conv_trans1 = Conv2D(64, kernel_size=5, padding='same')(upsamp1)
batch_norm1 = BatchNormalization(momentum=0.8)(conv_trans1)
leaky_relu1 = ReLU()(batch_norm1)
upsamp2 = UpSampling2D()(leaky_relu1)
conv_trans2 = Conv2D(64, kernel_size=5, padding='same')(upsamp2)
batch_norm2 = BatchNormalization(momentum=0.8)(conv_trans2)
leaky_relu2 = ReLU()(batch_norm2)
upsamp3 = UpSampling2D()(leaky_relu2)
conv_trans3 = Conv2D(64, kernel_size=5, padding='same')(upsamp3)
batch_norm3 = BatchNormalization(momentum=0.8)(conv_trans3)
leaky_relu3 = ReLU()(batch_norm3)
output = Conv2D(channels, kernel_size=3, padding='same', activation='tanh')(leaky_relu3)
generator = Model(inputs=[inputs_rand, inputs_feat], outputs=[output, inputs_feat])
return generator
I have read in a few places that Conv2DTranspose is preferable, however I can't seem to get it working. It just produces a repeating noise pattern according to the strides, and then no matter how long I leave it to train, it stays the same. Meanwhile the other method seems to work well enough, but I would like to get both methods working (just to satisfy my own curiosity). I think I must be doing something wrong, but my code looks pretty much the same as other examples I've found and I can't find anyone else having this sort of problem.
I have tried a few tweaks to the model, for example adding dropout and removing the batch normalisation just in case there was a simple fix, but nothing seems to work. I haven't included the rest of my code to keep things tidy, if it would help, though, I can add the rest.
This is the noise obtained when using the Conv2DTranspose layers.
Meanwhile the Upsampling with Conv2D layers produce these, for example.
Any comments and suggestions on how to improve my results would be welcomed too.