What would be the output from tensorflow dense layer if we assign itself as input and output while making a neural network? - tensorflow

I have been going through the implementation of neural network in openAI code for any Vanilla Policy Gradient (As a matter of fact, this part is used nearly everywhere). The code looks something like this :
def mlp_categorical_policy(x, a, hidden_sizes, activation, output_activation, action_space):
act_dim = action_space.n
logits = mlp(x, list(hidden_sizes) + [act_dim], activation, None)
logp_all = tf.nn.log_softmax(logits)
pi = tf.squeeze(tf.random.categorical(logits, 1), axis=1)
logp = tf.reduce_sum(tf.one_hot(a, depth=act_dim) * logp_all, axis=1)
logp_pi = tf.reduce_sum(tf.one_hot(pi, depth=act_dim) * logp_all, axis=1)
return pi, logp, logp_pi
and this multi-layered perceptron network is defined as follows :
def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
for h in hidden_sizes[:-1]:
x = tf.layers.dense(inputs=x, units=h, activation=activation)
return tf.layers.dense(inputs=x, units=hidden_sizes[-1], activation=output_activation)
My question is what is the return from this mlp function? I mean the structure or shape. Is it an N-dimentional tensor? If so, how is it given as an input to tf.random_categorical? If not, and its just has the shape [hidden_layer2, output], then what happened to the other layers? As per their website description about random_categorical it only takes a 2-D input. The complete code of openAI's VPG algorithm can be found here. The mlp is implemented here. I would be highly grateful if someone would just tell me what this mlp_categorical_policy() is doing?
Note: The hidden size is [64, 64], the action dimension is 3
Thanks and cheers

Note that this is a discrete action space - there are action_space.n different possible actions at every step, and the agent chooses one.
To do this the MLP is returning the logits (which are a function of the probabilities) of the different actions. This is specified in the code by + [act_dim] which is appending count of the action_space as the final MLP layer. Note that the last layer of an MLP is the output layer. The input layer is not specified in tensorflow, it is inferred from the inputs.
tf.random.categorical takes the logits and samples a policy action pi from them, which is returned as a number.
mlp_categorical_policy also returns logp, the log probability of the action a (used to assign credit), and logp_pi, the log probability of the policy action pi.
It seems your question is more about the return from the mlp.
The mlp creates a series of fully connected layers in a loop. In each iteration of the loop, the mlp is creating a new layer using the previous layer x as an input and assigning it's output to overwrite x, with this line x = tf.layers.dense(inputs=x, units=h, activation=activation).
So the output is not the same as the input, on each iteration x is overwritten with the value of the new layer. This is the same kind of coding trick as x = x + 1, which increments x by 1. This effectively chains the layers together.
The output of tf.layers.dense is a tensor of size [:,h] where : is the batch dimension (and can usually be ignored). The creation of the last layer happens outisde the loop, it can be seen that the number of nodes in this layer is act_dim (so shape is [:,3]). You can check the shape by doing this:
import tensorflow.compat.v1 as tf
import numpy as np
def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
for h in hidden_sizes[:-1]:
x = tf.layers.dense(x, units=h, activation=activation)
return tf.layers.dense(x, units=hidden_sizes[-1], activation=output_activation)
obs = np.array([[1.0,2.0]])
logits = mlp(obs, [64, 64, 3], tf.nn.relu, None)
print(logits.shape)
result: TensorShape([1, 3])
Note that the observation in this case is [1.,2.], it is nested inside a batch of size 1.

Related

I cant understand LSTM implementation in tensorflow 1

I have been looking at an implementation of LSTM layers in a neural network architecture. An LSTM layer has been defined in it as given below. I am having trouble understanding this code. I have listed my doubts after the code snippet.
code source:https://gist.github.com/awjuliani/66e8f477fc1ad000b1314809d8523455#file-a3c-py
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(RNN_SIZE,state_is_tuple=True)
c_init = np.zeros((1, lstm_cell.state_size.c), np.float32)
h_init = np.zeros((1, lstm_cell.state_size.h), np.float32)
state_init = [c_init, h_init]
c_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.c])
h_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.h])
state_in = (c_in, h_in)
rnn_in = tf.expand_dims(self.h3, [0])
step_size = tf.shape(inputs)[:1]
state_in = tf.nn.rnn_cell.LSTMStateTuple(c_in, h_in)
lstm_outputs, lstm_state = tf.nn.dynamic_rnn(
lstm_cell, rnn_in, initial_state=state_in, sequence_length=step_size,
time_major=False)
lstm_c, lstm_h = lstm_state
state_out = (lstm_c[:1, :], lstm_h[:1, :])
self.rnn_out = tf.reshape(lstm_outputs, [-1, RNN_SIZE])
Here are my doubts:
I understand we need to initialize a random Context and hidden
vectors to pass to our first LSTM cell. But why do initialize both c_init, h_init and then c_in, h_in. What purpose do they serve?
How are they different from each other? (same for state_in and state_init?)
Why do we use LSTMStateTuple?
def work(self, max_episode_length, gamma, sess, coord, saver, dep):
........
rnn_state = self.local_AC.state_init
def train(self, rollout, sess, gamma, bootstrap_value):
......
rnn_state = self.local_AC.state_init
feed_dict = {self.local_AC.target_v: discounted_rewards,
self.local_AC.inputs: np.vstack(observations),
self.local_AC.actions: actions,
self.local_AC.advantages: advantages,
self.local_AC.state_in[0]: rnn_state[0],
self.local_AC.state_in[1]: rnn_state[1]}
At the beginning of work, and then
before training a new batch, the network state is filled with zeros
I understand we need to initialize a random Context and hidden vectors to pass to our first LSTM cell. But why do initialize both c_init, h_init, and then c_in, h_in. What purpose do they serve? How are they different from each other? (same for state_in and state_init?)
To start using LSTM, one should initialise its cell and state state - named c and h respectively. For every input, these states are considered 'empty' and should be initialised with zeros. So that, we have here
c_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.c])
h_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.h])
state_in = (c_in, h_in)
state_in = tf.nn.rnn_cell.LSTMStateTuple(c_in, h_in)
Why are there are two variables, state_in and state_init? The first is just placeholders that will be initialised with the second at the evaluation state (i.e., session.run). Because state_in doesn't contain any actual values, in other words, numpy arrays are used during the training phase and tf.placeholders during the phase when one defines an architecture of the network.
TL;DR
Why so? Well, tf1.x (was?) is quite a low-level system. It has the following entities:
tf.Session aka computational session - thing that contain a computational graph(s) and allows user to provide inputs to the graph(s) via session.run.
tf.Graph, that is a representation of a computational graph. Usually engineer defines graph using tf.placeholders and tf.Variabless. One could connect them 'just like' math operations:
with tf.Session() as sess:
a = tf.placeholder(tf.float32, (1,))
b = tf.Variable(1.0, dtype=tf.float32)
tf.global_variables_initializer()
c = a * b
# ...and so on
tf. placeholder's are placeholers, but not actual values, intended to be filled with actual values at the session.run stage. And tf.Variables, well, for the actual weights of the neural network to be optimized. Why not plain NumPy arrays, but something else? It's because TensorFlow automatically adds each tensor and placeholder as an edge to the default computational graph (it's impossible to do the same with NumPy arrays); also, it allows to define an architecture and then initialize/train it with different inputs, which is good.
So, to do a computation (forward/backward propagation, etc.), one has to set placeholders and variables to some values. To do so, in a simple example, we could do the following:
import tensorflow as tf
with tf.compat.v1.Session() as sess:
a = tf.compat.v1.placeholder(tf.float32, shape=())
b = tf.compat.v1.Variable(1.0, dtype=tf.float32)
init = tf.compat.v1.global_variables_initializer()
c = a + b
sess.run(init)
a_value = 2.0
result = sess.run([c], feed_dict={a: a_value})
print("value of [c]:", result)
(I use tf.compat.v1 instead of just tf here because I work in tf2 environment; you could omit it)
Note two things: first, I create init operation. Because in tf1.x it is not enough to initialize a variable like tf.Variable(1.0), but the user has to kinda 'notify' the framework about creating and running init operation.
Then I do a computation: I initialize an a_value variable and map it to the placeholder a' in the sess.runmethod.Session.run` requires a list of tensors to be calculated as a first argument and a mapping from placeholders necessary to compute target tensors to their actual values.
Back to your example: state_in is a placeholder and state_init contains values to be fed into this placeholder somewhere in the code.
It would look like this: less.run(..., feed_dict={state_in: state_init, ...}).
Why do we use LSTMStateTuple?
Addressing the second part of the question: it looks like TensorFlow developers implemented it for some performance optimization. From the source code:
logging.warning(
"%s: Using a concatenated state is slower and will soon be"
"deprecated. Use state_is_tuple=True.", self)
and if state_is_tuple=True, state should be a StateTuple. But I'm not 100% sure about it - I don't remember how I used it. After all, StateTuple is just a collections.namedtuple with two named attributes, c and h.

Getting keras LSTM layer to accept two inputs?

I'm working with padded sequences of maximum length 50. I have two types of sequence data:
1) A sequence, seq1, of integers (1-100) that correspond to event types (e.g. [3,6,3,1,45,45....3]
2) A sequence, seq2, of integers representing time, in minutes, from the last event in seq1. So the last element is zero, by definition. So for example [100, 96, 96, 45, 44, 12,... 0]. seq1 and seq2 are the same length, 50.
I'm trying to run the LSTM primarily on the event/seq1 data, but have the time/seq2 strongly influence the forget gate within the LSTM. The reason for this is I want the LSTM to tend to really penalize older events and be more likely to forget them. I was thinking about multiplying the forget weight by the inverse of the current value of the time/seq2 sequence. Or maybe (1/seq2_element + 1), to handle cases where it's zero minutes.
I see in the keras code (LSTMCell class) where the change would have to be:
f = self.recurrent_activation(x_f + K.dot(h_tm1_f,self.recurrent_kernel_f))
So I need to modify keras' LSTM code to accept multiple inputs. As an initial test, within the LSTMCell class, I changed the call function to look like this:
def call(self, inputs, states, training=None):
time_input = inputs[1]
inputs = inputs[0]
So that it can handle two inputs given as a list.
When I try running the model with the Functional API:
# Input 1: event type sequences
# Take the event integer sequences, run them through an embedding layer to get float vectors, then run through LSTM
main_input = Input(shape =(max_seq_length,), dtype = 'int32', name = 'main_input')
x = Embedding(output_dim = embedding_length, input_dim = num_unique_event_symbols, input_length = max_seq_length, mask_zero=True)(main_input)
## Input 2: time vectors
auxiliary_input = Input(shape=(max_seq_length,1), dtype='float32', name='aux_input')
m = Masking(mask_value = 99999999.0)(auxiliary_input)
lstm_out = LSTM(32)(x, time_vector = m)
# Auxiliary loss here from first input
auxiliary_output = Dense(1, activation='sigmoid', name='aux_output')(lstm_out)
# An abitrary number of dense, hidden layers here
x = Dense(64, activation='relu')(lstm_out)
# The main output node
main_output = Dense(1, activation='sigmoid', name='main_output')(x)
## Compile and fit the model
model = Model(inputs=[main_input, auxiliary_input], outputs=[main_output, auxiliary_output])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'], loss_weights=[1., 0.2])
print(model.summary())
np.random.seed(21)
model.fit([train_X1, train_X2], [train_Y, train_Y], epochs=1, batch_size=200)
However, I get the following error:
An `initial_state` was passed that is not compatible with `cell.state_size`. Received `state_spec`=[InputSpec(shape=(None, 50, 1), ndim=3)]; however `cell.state_size` is (32, 32)
Any advice?
You can't pass a list of inputs to default recurrent layers in Keras. The input_spec is fixed and the recurrent code is implemented based on single tensor input also pointed out in the documentation, ie it doesn't magically iterate over 2 inputs of same timesteps and pass that to the cell. This is partly because of how the iterations are optimised and assumptions made if the network is unrolled etc.
If you like 2 inputs, you can pass constants (doc) to the cell which will pass the tensor as is. This is mainly to implement attention models in the future. So 1 input will iterate over timesteps while the other will not. If you really like 2 inputs to be iterated like a zip() in python, you will have to implement a custom layer.
I would like to throw in a different ideas here. They don't require you to modify the Keras code.
After the embedding layer of the event types, stack the embeddings with the elapsed time. The Keras function is keras.layers.Concatenate(axis=-1). Imagine this, a single even type is mapped to a n dimensional vector by the embedding layer. You just add the elapsed time as one more dimension after the embedding so that it becomes a n+1 vector.
Another idea, sort of related to your problem/question and may help here, is 1D convolution. The convolution can happen right after the concatenated embeddings. The intuition for applying convolution to event types and elapsed time is actually 1x1 convolution. In such a way that you linearly combine the two together and the parameters are trained. Note in terms of convolution, the dimensions of the vectors are called channels. Of course, you can also convolve more than 1 event at a step. Just try it. It may or may not help.

autocorrelation of the input in tensorflow/keras

I have a 1D input signal. I want to compute autocorrelation as the part of the neural net for further use inside the network.
I need to perform convolution of input with input itself.
To perform convolution in keras custom layer/ tensorflow. We need the following parameters
data shape is "[batch, in_height, in_width, in_channels]",
filter shape is "[filter_height, filter_width, in_channels, out_channels]
There is no batch present in filter shape, which needs to be input in my case
TensorFlow now has an auto_correlation function. It should be in release 1.6. If you build from source you can use it right now (see e.g. the github code).
Here is a possible solution.
By self convolution, I understood a regular convolution where the filter is exactly the same as the input (if it's not that, sorry for my misunderstanding).
We need a custom function for that, and a Lambda layer.
At first I used padding = 'same' which brings outputs with the same length as the inputs. I'm not sure about what output length you want exactly, but if you want more, you should add padding yourself before doing the convolution. (In the example with length 7, for a complete convolution from one end to another, this manual padding would include 6 zeros before and 6 zeros after the input length, and use padding = 'valid'. Find the backend functions here)
Working example - Input (5,7,2)
from keras.models import Model
from keras.layers import *
import keras.backend as K
batch_size = 5
length = 7
channels = 2
channels_batch = batch_size*channels
def selfConv1D(x):
#this function unfortunately needs to know previously the shapes
#mainly because of the for loop, for other lines, there are workarounds
#but these workarounds are not necessary since we'll have this limitation anyway
#original x: (batch_size, length, channels)
#bring channels to the batch position:
x = K.permute_dimensions(x,[2,0,1]) #(channels, batch_size, length)
#suppose channels are just individual samples (since we don't mix channels)
x = K.reshape(x,(channels_batch,length,1))
#here, we get a copy of x reshaped to match filter shapes:
filters = K.permute_dimensions(x,[1,2,0]) #(length, 1, channels_batch)
#now, in the lack of a suitable available conv function, we make a loop
allChannels = []
for i in range (channels_batch):
f = filters[:,:,i:i+1]
allChannels.append(
K.conv1d(
x[i:i+1],
f,
padding='same',
data_format='channels_last'))
#although channels_last is my default config, I found this bug:
#https://github.com/fchollet/keras/issues/8183
#convolution output: (1, length, 1)
#concatenate all results as samples
x = K.concatenate(allChannels, axis=0) #(channels_batch,length,1)
#restore the original form (passing channels to the end)
x = K.reshape(x,(channels,batch_size,length))
return K.permute_dimensions(x,[1,2,0]) #(batch_size, length, channels)
#input data for the test:
x = np.array(range(70)).reshape((5,7,2))
#little model that just performs the convolution
inp= Input((7,2))
out = Lambda(selfConv1D)(inp)
model = Model(inp,out)
#checking results
p = model.predict(x)
for i in range(5):
print("x",x[i])
print("p",p[i])
You can just use tf.nn.conv3d by treating the "batch size" as "depth":
# treat the batch size as depth.
data = tf.reshape(input_data, [1, batch, in_height, in_width, in_channels])
kernel = [filter_depth, filter_height, filter_width, in_channels, out_channels]
out = tf.nn.conv3d(data, kernel, [1,1,1,1,1], padding='SAME')

Does K.function method of Keras with Tensorflow backend work with network layers?

I recently have started using Keras to build neural networks. I built a simple CNN to classify MNIST dataset. Before learning the model I used K.set_image_dim_ordering('th') in order to plot a convolutional layer weights. Right now I am trying to visualize convolutional layer output with K.function method, but I keep getting error.
Here is what I want to do for now:
input_image = X_train[2:3,:,:,:]
output_layer = model.layers[1].output
input_layer = model.layers[0].input
output_fn = K.function(input_layer, output_layer)
output_image = output_fn.predict(input_image)
print(output_image.shape)
output_image = np.rollaxis(np.rollaxis(output_image, 3, 1), 3, 1)
print(output_image.shape)
fig = plt.figure()
for i in range(32):
ax = fig.add_subplot(4,8,i+1)
im = ax.imshow(output_image[0,:,:,i], cmap="Greys")
plt.xticks(np.array([]))
plt.yticks(np.array([]))
fig.subplots_adjust(right=0.8)
cbar_ax = fig.add_axes([1, 0.1, 0.05 ,0.8])
fig.colorbar(im, cax = cbar_ax)
plt.tight_layout()
plt.show()
And this is what I get:
File "/home/kinshiryuu/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 1621, in function
return Function(inputs, outputs, updates=updates)
File "/home/kinshiryuu/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 1569, in __init__
raise TypeError('`inputs` to a TensorFlow backend function '
TypeError: `inputs` to a TensorFlow backend function should be a list or tuple.
You should do the following changes:
output_fn = K.function([input_layer], [output_layer])
output_image = output_fn([input_image])
K.function takes the input and output tensors as list so that you can create a function from many input to many output. In your case one input to one output.. but you need to pass them as a list none the less.
Next K.function returns a tensor function and not a model object where you can use predict(). The correct way of using is just to call as a function
I think you can also use K.function to get gradients.
self.action_gradients = K.gradients(Q_values, actions)
self.get_action_gradients=K.function[*self.model.input, K.learning_phase()], outputs=action_gradients)
which basically runs the graph to obtain the Q-value to calculate the gradient of the Q-value w.r.t. action vector in DDPG. Source code here (lines 64 to 70): https://github.com/nyck33/autonomous_quadcopter/blob/master/criticSolution.py#L65
In light of the accepted answer and this usage here (originally from project 5 autonomous quadcopter in the Udacity Deep Learning nanodegree), a question remains in my mind, ie. is K.function() something that can be used fairly flexibly to run the graph and to designate as outputs of K.function() for example outputs of a particular layer, gradients or even weights themselves?
Lines 64 to 67 here: https://github.com/nyck33/autonomous_quadcopter/blob/master/actorSolution.py
It is being used as a custom training function for the actor network in DDPG:
#caller
self.actor_local.train_fn([states, action_gradients, 1])
#called
self.train_fn = K.function(inputs=[self.model.input, action_gradients, K.learning_phase()], \
outputs=[], updates=updates_op)
outputs is given a value of an empty list because we merely want to train the actor network with the action_gradients from the critic network.

Weights of Seq2Seq Models

I went through the code and I'm afraid I don't grasp an important point.
I can't seem to find the weights matrix of the model for the encoder and decoder, neither where they are updated. I found the target_weights but it seems to be reinitialized at every get_batch() call so I don't really understand what they stand for either.
My actual goal is to concatenate two hidden states of two source encoders for one decoder by applying a linear transformation with a weight matrix that I'll have to train along with the model (I'm building a manytoone model), but I have no idea where to start because of my problem mentionned above.
This might help you start. There are a couple of models implemented in tensorflow.python.ops.seq2seq.py (with/without buckets, attention, etc.) but take a look at the definition for embedding_attention_seq2seq (which is the one called in their example model seq2seq_model.py that you seem to be referencing):
def embedding_attention_seq2seq(encoder_inputs, decoder_inputs, cell,
num_encoder_symbols, num_decoder_symbols,
num_heads=1, output_projection=None,
feed_previous=False, dtype=dtypes.float32,
scope=None, initial_state_attention=False):
with variable_scope.variable_scope(scope or "embedding_attention_seq2seq"):
# Encoder.
encoder_cell = rnn_cell.EmbeddingWrapper(cell, num_encoder_symbols)
encoder_outputs, encoder_state = rnn.rnn(
encoder_cell, encoder_inputs, dtype=dtype)
# First calculate a concatenation of encoder outputs to put attention on.
top_states = [array_ops.reshape(e, [-1, 1, cell.output_size])
for e in encoder_outputs]
attention_states = array_ops.concat(1, top_states)
....
You can see where it picks out the top layer of encoder outputs as top_states before handing them off to the decoder.
So you could implement a similar function with two encoders and concatenate those states before handing off to the decoder.
The value created in the get_batch function is only used for the first iteration. Even though the weights are passed every time into the function, their value gets updated as a global variable in the Seq2Seq model class in the init function.
with tf.name_scope('Optimizer'):
# Gradients and SGD update operation for training the model.
params = tf.trainable_variables()
if not forward_only:
self.gradient_norms = []
self.updates = []
opt = tf.train.GradientDescentOptimizer(self.learning_rate)
for b in range(len(buckets)):
gradients = tf.gradients(self.losses[b], params)
clipped_gradients, norm = tf.clip_by_global_norm(gradients,
max_gradient_norm)
self.gradient_norms.append(norm)
self.updates.append(opt.apply_gradients(
zip(clipped_gradients, params), global_step=self.global_step))
self.saver = tf.train.Saver(tf.global_variables())
The weights are fed seperately as a place-holder because they are normalized in the get_batch function to create zero weights for the PAD inputs.
# Batch decoder inputs are re-indexed decoder_inputs, we create weights.
for length_idx in range(decoder_size):
batch_decoder_inputs.append(
np.array([decoder_inputs[batch_idx][length_idx]
for batch_idx in range(self.batch_size)], dtype=np.int32))
# Create target_weights to be 0 for targets that are padding.
batch_weight = np.ones(self.batch_size, dtype=np.float32)
for batch_idx in range(self.batch_size):
# We set weight to 0 if the corresponding target is a PAD symbol.
# The corresponding target is decoder_input shifted by 1 forward.
if length_idx < decoder_size - 1:
target = decoder_inputs[batch_idx][length_idx + 1]
if length_idx == decoder_size - 1 or target == data_utils.PAD_ID:
batch_weight[batch_idx] = 0.0
batch_weights.append(batch_weight)