Custom linear transformation in keras - tensorflow

I want to build a customized layer in keras to do a linear transformation on the output of last layer.
For example, I got an output X from last layer, my new layer will output X.dot(W)+b.
The shape of W is (49,10), and the shape of X should be (64,49), the shape of b is (10,)
However, the shape of X is (?, 7, 7, 64), when I am trying to reshape it, it becomes shape=(64, ?). What is the meaning of question mark? Could you tell me a proper way to do linear transformation on the output of last layer?

The question mark generally represents the batch size, which has no effect on the model architecture.
You should be able to reshape your X with keras.layers.Reshape((64,49))(X).
You can wrap arbitrary tensorflow operations such as tf.matmul in a Lambda layer to include custom layers in your Keras model. Minimal working example that does the trick:
import tensorflow as tf
from keras.layers import Dense, Lambda, Input
from keras.models import Model
W = tf.random_normal(shape=(128,20))
b = tf.random_normal(shape=(20,))
inp = Input(shape=(10,))
x = Dense(128)(inp)
y = Lambda(lambda x: tf.matmul(x, W) + b)(x)
model = Model(inp, y)
Finally: refer to the Keras documentation on how to write custom layers with trainable weights.

Related

What would be the output from tensorflow dense layer if we assign itself as input and output while making a neural network?

I have been going through the implementation of neural network in openAI code for any Vanilla Policy Gradient (As a matter of fact, this part is used nearly everywhere). The code looks something like this :
def mlp_categorical_policy(x, a, hidden_sizes, activation, output_activation, action_space):
act_dim = action_space.n
logits = mlp(x, list(hidden_sizes) + [act_dim], activation, None)
logp_all = tf.nn.log_softmax(logits)
pi = tf.squeeze(tf.random.categorical(logits, 1), axis=1)
logp = tf.reduce_sum(tf.one_hot(a, depth=act_dim) * logp_all, axis=1)
logp_pi = tf.reduce_sum(tf.one_hot(pi, depth=act_dim) * logp_all, axis=1)
return pi, logp, logp_pi
and this multi-layered perceptron network is defined as follows :
def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
for h in hidden_sizes[:-1]:
x = tf.layers.dense(inputs=x, units=h, activation=activation)
return tf.layers.dense(inputs=x, units=hidden_sizes[-1], activation=output_activation)
My question is what is the return from this mlp function? I mean the structure or shape. Is it an N-dimentional tensor? If so, how is it given as an input to tf.random_categorical? If not, and its just has the shape [hidden_layer2, output], then what happened to the other layers? As per their website description about random_categorical it only takes a 2-D input. The complete code of openAI's VPG algorithm can be found here. The mlp is implemented here. I would be highly grateful if someone would just tell me what this mlp_categorical_policy() is doing?
Note: The hidden size is [64, 64], the action dimension is 3
Thanks and cheers
Note that this is a discrete action space - there are action_space.n different possible actions at every step, and the agent chooses one.
To do this the MLP is returning the logits (which are a function of the probabilities) of the different actions. This is specified in the code by + [act_dim] which is appending count of the action_space as the final MLP layer. Note that the last layer of an MLP is the output layer. The input layer is not specified in tensorflow, it is inferred from the inputs.
tf.random.categorical takes the logits and samples a policy action pi from them, which is returned as a number.
mlp_categorical_policy also returns logp, the log probability of the action a (used to assign credit), and logp_pi, the log probability of the policy action pi.
It seems your question is more about the return from the mlp.
The mlp creates a series of fully connected layers in a loop. In each iteration of the loop, the mlp is creating a new layer using the previous layer x as an input and assigning it's output to overwrite x, with this line x = tf.layers.dense(inputs=x, units=h, activation=activation).
So the output is not the same as the input, on each iteration x is overwritten with the value of the new layer. This is the same kind of coding trick as x = x + 1, which increments x by 1. This effectively chains the layers together.
The output of tf.layers.dense is a tensor of size [:,h] where : is the batch dimension (and can usually be ignored). The creation of the last layer happens outisde the loop, it can be seen that the number of nodes in this layer is act_dim (so shape is [:,3]). You can check the shape by doing this:
import tensorflow.compat.v1 as tf
import numpy as np
def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
for h in hidden_sizes[:-1]:
x = tf.layers.dense(x, units=h, activation=activation)
return tf.layers.dense(x, units=hidden_sizes[-1], activation=output_activation)
obs = np.array([[1.0,2.0]])
logits = mlp(obs, [64, 64, 3], tf.nn.relu, None)
print(logits.shape)
result: TensorShape([1, 3])
Note that the observation in this case is [1.,2.], it is nested inside a batch of size 1.

Batch normalization layer for CNN-LSTM

Suppose that I have a model like this (this is a model for time series forecasting):
ipt = Input((data.shape[1] ,data.shape[2])) # 1
x = Conv1D(filters = 10, kernel_size = 3, padding = 'causal', activation = 'relu')(ipt) # 2
x = LSTM(15, return_sequences = False)(x) # 3
x = BatchNormalization()(x) # 4
out = Dense(1, activation = 'relu')(x) # 5
Now I want to add batch normalization layer to this network. Considering the fact that batch normalization doesn't work with LSTM, Can I add it before Conv1D layer? I think it's rational to have a batch normalization layer after LSTM.
Also, where can I add Dropout in this network? The same places? (after or before batch normalization?)
What about adding AveragePooling1D between Conv1D and LSTM? Is it possible to add batch normalization between Conv1D and AveragePooling1D in this case without any effect on LSTM layer?
Update: the LayerNormalization implementation I was using was inter-layer, not recurrent as in the original paper; results with latter may prove superior.
BatchNormalization can work with LSTMs - the linked SO gives false advice; in fact, in my application of EEG classification, it dominated LayerNormalization. Now to your case:
"Can I add it before Conv1D"? Don't - instead, standardize your data beforehand, else you're employing an inferior variant to do the same thing
Try both: BatchNormalization before an activation, and after - apply to both Conv1D and LSTM
If your model is exactly as you show it, BN after LSTM may be counterproductive per ability to introduce noise, which can confuse the classifier layer - but this is about being one layer before output, not LSTM
If you aren't using stacked LSTM with return_sequences=True preceding return_sequences=False, you can place Dropout anywhere - before LSTM, after, or both
Spatial Dropout: drop units / channels instead of random activations (see bottom); was shown more effective at reducing coadaptation in CNNs in paper by LeCun, et al, w/ ideas applicable to RNNs. Can considerably increase convergence time, but also improve performance
recurrent_dropout is still preferable to Dropout for LSTM - however, you can do both; just do not use with with activation='relu', for which LSTM is unstable per a bug
For data of your dimensionality, any sort of Pooling is redundant and may harm performance; scarce data is better transformed via a non-linearity than simple averaging ops
I strongly recommend a SqueezeExcite block after your Conv; it's a form of self-attention - see paper; my implementation for 1D below
I also recommend trying activation='selu' with AlphaDropout and 'lecun_normal' initialization, per paper Self Normalizing Neural Networks
Disclaimer: above advice may not apply to NLP and embed-like tasks
Below is an example template you can use as a starting point; I also recommend the following SO's for further reading: Regularizing RNNs, and Visualizing RNN gradients
from keras.layers import Input, Dense, LSTM, Conv1D, Activation
from keras.layers import AlphaDropout, BatchNormalization
from keras.layers import GlobalAveragePooling1D, Reshape, multiply
from keras.models import Model
import keras.backend as K
import numpy as np
def make_model(batch_shape):
ipt = Input(batch_shape=batch_shape)
x = ConvBlock(ipt)
x = LSTM(16, return_sequences=False, recurrent_dropout=0.2)(x)
# x = BatchNormalization()(x) # may or may not work well
out = Dense(1, activation='relu')
model = Model(ipt, out)
model.compile('nadam', 'mse')
return model
def make_data(batch_shape): # toy data
return (np.random.randn(*batch_shape),
np.random.uniform(0, 2, (batch_shape[0], 1)))
batch_shape = (32, 21, 20)
model = make_model(batch_shape)
x, y = make_data(batch_shape)
model.train_on_batch(x, y)
Functions used:
def ConvBlock(_input): # cleaner code
x = Conv1D(filters=10, kernel_size=3, padding='causal', use_bias=False,
kernel_initializer='lecun_normal')(_input)
x = BatchNormalization(scale=False)(x)
x = Activation('selu')(x)
x = AlphaDropout(0.1)(x)
out = SqueezeExcite(x)
return out
def SqueezeExcite(_input, r=4): # r == "reduction factor"; see paper
filters = K.int_shape(_input)[-1]
se = GlobalAveragePooling1D()(_input)
se = Reshape((1, filters))(se)
se = Dense(filters//r, activation='relu', use_bias=False,
kernel_initializer='he_normal')(se)
se = Dense(filters, activation='sigmoid', use_bias=False,
kernel_initializer='he_normal')(se)
return multiply([_input, se])
Spatial Dropout: pass noise_shape = (batch_size, 1, channels) to Dropout - has the effect below; see Git gist for code:

Unable to track record by record processing in LSTM algorithm for text classification?

We are working on multi-class text classification and following is the process which we have used.
1) We have created 300 dim's vector with word2vec word embedding using our own data and then passed that vector as a weights to LSTM embedding layer.
2) And then we have used one LSTM layer and one dense layer.
Here below is my code:
input_layer = layers.Input((train_seq_x.shape[1], ))
embedding_layer = layers.Embedding(len(word_index)+1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)
lstm_layer1 = layers.LSTM(300,return_sequences=True,activation="relu")(embedding_layer)
lstm_layer1 = layers.Dropout(0.5)(lstm_layer1)
flat_layer = layers.Flatten()(lstm_layer1)
output_layer = layers.Dense(33, activation="sigmoid")(flat_layer)
model = models.Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer=optimizers.Adam(), loss='categorical_crossentropy',metrics=['accuracy'])
Please help me out on the below questions:
Q1) Why did we pass word embedding vector(300 dim's) as weights in LSTM embedding layer?
Q2) How can we know optimal number of neural in LSTM layer?
Q3) Can you please explain how the single record processing in LSTM algorithm?
Please let me know if you requires more information on the same.
Q1) Why did we pass word embedding vector(300 dim's) as weights in
LSTM embedding layer?
In a very simplistic way, you can think of an embedding layers as a lookup table which converts a word (represented by its index in a dictionary) to a vector. It is a trainable layers. Since you have already trained word embeddings instead of initializing the embedding layer with the random weight you initialize it with the vectors you have learned.
Embedding(len(word_index)+1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
So here you are
creating an embedding layer or a look up table which can lookup words
indices 0 to len(word_index).
Each lookuped up word will map to a vector of size 300.
This lookup table is loaded with the vectors from "embedding_matrix"
(which is a pretrained model).
trainable=False will freez the weight in this layer.
You have passed 300 because it is the vector size of your pretrained model (embedding_matrix)
Q2) How can we know optimal number of neural in LSTM layer?
You have created a LSTM layer with takes 300 size vector as input and returns a vector of size 300. The output size and number of stacked LSTMS are hyperparameters which is tuned manually (usually using KFold CV)
Q3) Can you please explain how the single record processing in LSTM
algorithm?
A single record/sentence(s) are converted into indices of the vocabulary. So for every sentence you have an array of indices.
A batch of these sentences are created and feed as input to the model.
LSTM is unwrapped by passing in one index at a time as input at each timestep.
Finally the ouput of the LSTM is forward propagated by a final dense
layer to size 33. So looks like each input is mapped to one of 33
classes in your case.
Simple example
import numpy as np
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten, LSTM
from keras.layers.embeddings import Embedding
from nltk.lm import Vocabulary
from keras.utils import to_categorical
training_data = [ "it was a good movie".split(), "it was a bad movie".split()]
training_target = [1, 0]
v = Vocabulary([word for s in training_data for word in s])
model = Sequential()
model.add(Embedding(len(v),50,input_length = 5, dropout = 0.2))
model.add(LSTM(10, dropout_U = 0.2, dropout_W = 0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())
x = np.array([list(map(lambda x: v[x], s)) for s in training_data])
y = to_categorical(training_target)
model.fit(x,y)

Keras fails to set dynamic shape of layer properly

I am using keras==2.0.8 with tensorflow==1.3.0 backend.
Here is the example which I am confused with:
from keras.layers import Input, Reshape, Conv2DTranspose
x = Input((5000,))
y = Reshape((25, 25, 8))(x)
y = Conv2DTranspose(10, 5, padding='same', strides=2)(y)
print(y)
It's just part of my model and after these lines I use y in some tensorflow operations, but code above prints node of shape (?, ?, ?, 10). I have no idea why TF cannot deduce height and width of resulting tensor statically. (I know that keras can, but I want TF node with proper shape)
If you intend to use these tensorflow operations in a keras model, you have to use them inside Lambda layers.
In the function you create for the lambda layer, you can use the given tensor normally. Unless you have a very specific reason for tensorflow to have this fixed size explicit, there won't be any problem. Is there any special need that demands you to have the tensorflow tensor with explicit shape?
In Keras, you can always use K.shape() in a keras tensor to get its shape. Many keras backend functions can take this shape (mostly with tensorflow) as input. If you can use the keras backend functions instead of pure tensorflow functions, your code may be portable to other backends later.
Example of function:
def tensorflowPart(x):
#do tensorflow operations with the tensor x
shape = K.shape(x) #use the shape of the tensor, as a tensor
#more tensorflow operations
return result
Use the lambda layer in your model:
y = Lambda(tensorflowPart)(y)

How do I flip a Tensor in Keras?

For example: I have a tensor with shape (5,10) and I want back a tensor with shape (5,10) but the first element should now be the last element. so [1,2,3,4,5]becomes [5,4,3,2,1] and [[1,2,3,4,5],[2,3,4,5,6]] becomes [[2,3,4,5,6],[1,2,3,4,5]].
If it matter, I am using tensorflow backend.
Using the Keras backend, there is the reverse function.
import keras.backend as K
flipped = K.reverse(x,axes=0)
For using it in a layer, you can create a Lambda layer:
from keras.layers import *
layer = Lambda(lambda x: K.reverse(x,axes=0),output_shape=(shape of x))
(If it's a sequential layer, model.add(layer), if a functional API model, output = layer(input)