Dimension mismatch while building LSTM RNN in Tensorflow - tensorflow

I am trying to build multilayer, multiclass, multilabel LSTM in Tensorflow. I have been trying to bend this tutorial to my data.
However, I am getting an error that says I have dimension mismatch when building RNN.
ValueError: Dimensions must be equal, but are 1000 and 923 for 'rnn/while/rnn/multi_rnn_cell/cell_0/lstm_cell/MatMul_1' (op: 'MatMul') with input shapes: [?,1000], [923,2000].
I cannot pinpoint which variable is incorrect in building architecture:
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
def bias_variable(shape):
initial = tf.constant(0.0, shape=shape)
return tf.Variable(initial)
def lstm(x, weight, bias, n_steps, n_classes):
cell = rnn_cell.LSTMCell(cfg.n_hidden_cells_in_layer, state_is_tuple=True)
multi_layer_cell = tf.nn.rnn_cell.MultiRNNCell([cell] * 2)
# FIXME : ERROR binding x to LSTM as it is
output, state = tf.nn.dynamic_rnn(multi_layer_cell, x, dtype=tf.float32)
# FIXME : ERROR
output_flattened = tf.reshape(output, [-1, cfg.n_hidden_cells_in_layer])
output_logits = tf.add(tf.matmul(output_flattened, weight), bias)
output_all = tf.nn.sigmoid(output_logits)
output_reshaped = tf.reshape(output_all, [-1, n_steps, n_classes])
# ??? switch batch size with sequence size. ???
# then gather last time step values
output_last = tf.gather(tf.transpose(output_reshaped, [1, 0, 2]), n_steps - 1)
return output_last, output_all
These are my placeholders, loss function and all that jazz:
x_test, y_test = load_multiple_vector_files(test_filepaths)
x_valid, y_valid = load_multiple_vector_files(valid_filepaths)
n_input, n_steps, n_classes = get_input_target_lengths(check_print=False)
# FIXME n_input should be the problem
x = tf.placeholder("float", [None, n_steps, n_input])
y = tf.placeholder("float", [None, n_classes])
y_steps = tf.placeholder("float", [None, n_classes])
weight = weight_variable([cfg.n_hidden_layers, n_classes])
bias = bias_variable([n_classes])
y_last, y_all = lstm(x, weight, bias, n_steps, n_classes)
#all_steps_cost=tf.reduce_mean(-tf.reduce_mean((y_steps * tf.log(y_all))+(1 - y_steps) * tf.log(1 - y_all),reduction_indices=1))
all_steps_cost = -tf.reduce_mean((y_steps * tf.log(y_all)) + (1 - y_steps) * tf.log(1 - y_all))
last_step_cost = -tf.reduce_mean((y * tf.log(y_last)) + ((1 - y) * tf.log(1 - y_last)))
loss_function = (cfg.alpha * all_steps_cost) + ((1 - cfg.alpha) * last_step_cost)
optimizer = tf.train.AdamOptimizer(learning_rate=cfg.learning_rate).minimize(loss_function)
I am pretty sure it is my X placeholder that is causing the problem, resulting in layers and their matrices dimensions not matching. The constant which the linked example is using is rather tough to see what it actually stands for.
Can anyone help me out here? :)
UPDATE:
I have made an "educated guess" on the mismatching dimensions.
One is 2*hidden_width, so hidden getting new input + its old recurrent input. The mismatching dimension, however, is input_width + hidden_width, like it was trying to set recurrency for width of hidden layer to the input layer.

I figured out I am incorrectly setting the weight variable, using the constant for n_hidden_layers(number of hidden layers) instead of n_hidden_cells_in_layer(number of layers).

Related

Difference equation in LSTM network on Tensorflow

I'd like to use a LSTM network on Tensorflow to implement a difference equation. I searched on internet but I didn't find anything about this topic.
The equation is:
formula
in which b=[1, 2, 1] and a=[1, -1.6641, 0.8387].
My aim is to use a neural network to find the correlation between input and output. Due to that to find the output ad k-instant you have to know also the previous inputs and outputs, my idea is to implement a LSTM network (many to one structure).
If we suppose to have an input vector of 500 samples and to use a window size of 5, the input of LSTM network is a vector of shape (500,5,1) while the output is (500,1,1).
The IN%OUT of first iteration are:
[0; x(k-4), x(k-3), x(k-2), x(k-1), x(k); 1] -> [1; y(k); 1]
formula
in the second iteration:
[0; x(k-3), x(k-2), x(k-1), x(k), x(k+1); 1] -> [1; y(k+1); 1]
formula
So I used a LSMT network with stateful set to TRUE to allow the network to remember past states but it doesn't converge.
It seems to me that the idea is correct but I cannot see where I am going wrong. Could someone help me find the problem? I copy and paste the code below and the network is developed on Tensorflow.
# Difference equation
K = 0.0436
b = np.array([1,2,1])
a = np.array([1, -1.6641, 0.8387])
x = np.random.uniform(0, 1, 100)
y = K*(signal.lfilter(b,a,x))
# Generate Dataset
X_train = np.random.uniform(0, 1, 100)
y_train = K*(signal.lfilter(b,a,X_train))
X_val = np.ones(100)
y_val = K*(signal.lfilter(b,a,X_val))
X_test = np.random.uniform(0.5, 0.8, 100)
y_test = K*(signal.lfilter(b,a,X_test))
def get_x_split(data, windows_size):
""" Return sliding window dataset. """
x_temp = np.zeros([1,windows_size-1])
x = np.array([])
for i in range(0,len(data)):
x_temp = np.append(x_temp[-windows_size+1:], data[i]).T
x = np.append(x, x_temp, axis=0)
x = np.reshape(x, (int(len(x)/windows_size), windows_size))
return x
windows_size = 10
X_train = get_x_split(X_train, windows_size)
X_val = get_x_split(X_val, windows_size)
X_test = get_x_split(X_test, windows_size)
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
X_val = np.reshape(X_val, (X_val.shape[0], X_val.shape[1], 1))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
# Model Definition
activation_function = 'tanh'
def build_model():
input_layer = Input(shape=(X_train.shape[1],1), batch_size=1)
HL_1 = LSTM(1, activation=activation_function, return_sequences=True, stateful = True)(input_layer)
HL_2 = LSTM(1, activation=activation_function, return_sequences=False, stateful = True)(HL_1)
output_layer = Dense(1, activation='relu',name='Output')(HL_2)
model = Model(inputs=input_layer, outputs=output_layer)
return model
model = build_model()
model.compile(optimizer=RMSprop(),
loss={'Output': 'mse'}, #mse
metrics={'Output': tf.keras.metrics.RootMeanSquaredError()})
# Training
history = model.fit(x=X_train,
y=y_train,
batch_size=1,
validation_data=(X_val, y_val),
epochs=5000,
verbose=1,
shuffle=False)
# Test
y_pred = model.predict(X_test)
pred_samples = 400
plt.figure(dpi=1200)
plt.plot(y_test[300:pred_samples,3,0], label='true', linewidth=0.8, alpha=0.5)
plt.plot(y_pred[300:pred_samples,3,0], label='pred')
plt.legend()
plt.grid()
plt.title("Test")
plt.show()

Input to attention in TensorFlow 2.0 tutorial on "Neural machine translation with attention"

There is one question when I learned the example "Neural machine translation with attention".
class Decoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
super(Decoder, self).__init__()
self.batch_sz = batch_sz
self.dec_units = dec_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.dec_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
self.fc = tf.keras.layers.Dense(vocab_size)
# used for attention
self.attention = BahdanauAttention(self.dec_units)
def call(self, x, hidden, enc_output):
# enc_output shape == (batch_size, max_length, hidden_size)
context_vector, attention_weights = self.attention(hidden, enc_output)
# x shape after passing through embedding == (batch_size, 1, embedding_dim)
x = self.embedding(x)
# x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
# passing the concatenated vector to the GRU
output, state = self.gru(x)
# output shape == (batch_size * 1, hidden_size)
output = tf.reshape(output, (-1, output.shape[2]))
# output shape == (batch_size, vocab)
x = self.fc(output)
return x, state, attention_weights
Why the attention weight is calculated by encoder_output and encoder_hidden and context vector is contacted with decoder_embedding. In my opinion, the attention weight should be calculated by encoder_output and every single hidden of decoder_output, and context vector should be contacted with decoder_output.
Maybe I have not understood the seq2seq with attention completely?
The attention is called in every step of the decoder. The inputs to the decoder step are:
previously decoded token x (or ground-truth token while training)
previous hidden state of the decoder hidden
hidden states of the encoder enc_output
As you correctly say, the attention the single decoder hidden states and all encoder hidden states as input which gives you the context vector.
context_vector, attention_weights = self.attention(hidden, enc_output)
The context vector gets concatenated with the embedding only after calling the attention mechanism when it is used as the input of the GRU cell.
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
output, state = self.gru(x)
The variable output will become hidden in the next step of the decoder.

TensorFlow: Why do parameters not update when GradientDescentOptimizer train step is run?

When I run the following code, it prints a constant loss at every training step; I also tried printing the parameters, which also do not change.
I can't seem to figure out why train_step, which uses a GradientDescentOptimizer, doesnt change the weights in W_fc1, b_fc1, W_fc2, and b_fc2.
I'm a beginner to machine learning so I might be missing something obvious.
(An answer for a similar question was that weights should not be initialized at zero, but the weights here are initialized with truncated normal so that cant be the problem).
import tensorflow as tf
import numpy as np
import csv
import random
with open('wine_data.csv', 'rb') as csvfile:
input_arr = list(csv.reader(csvfile, delimiter=','))
for i in range(len(input_arr)):
input_arr[i][0] = int(input_arr[i][0]) - 1 # 0 index for one hot
for j in range(1, len(input_arr[i])):
input_arr[i][j] = float(input_arr[i][j])
random.shuffle(input_arr)
training_data = np.array(input_arr[:2*len(input_arr)/3]) # train on first two thirds of data
testing_data = np.array(input_arr[2*len(input_arr)/3:]) # test on last third of data
x_train = training_data[0:, 1:]
y_train = training_data[0:, 0]
x_test = testing_data[0:, 1:]
y_test = testing_data[0:, 0]
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
def bias_variable(shape):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)
x = tf.placeholder(tf.float32, shape=[None, 13], name='x')
y_ = tf.placeholder(tf.float32, shape=[None], name='y_')
y_one_hot = tf.one_hot(tf.cast(y_, tf.int32), 3) # actual y values
W_fc1 = weight_variable([13, 128])
b_fc1 = bias_variable([128])
fc1 = tf.matmul(x, W_fc1)+b_fc1
W_fc2 = weight_variable([128, 3])
b_fc2 = bias_variable([3])
y = tf.nn.softmax(tf.matmul(fc1, W_fc2)+b_fc2)
cross_entropy = tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(labels=y_one_hot, logits=y))
train_step = tf.train.GradientDescentOptimizer(1e-17).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_one_hot,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
for _ in range(1000):
train_step.run(feed_dict={x: x_train, y_: y_train})
if _%10 == 0:
loss = cross_entropy.eval(feed_dict={x: x_train, y_: y_train})
print('step', _, 'loss', loss)
Thanks in advance.
From the official tensorflow documentation:
WARNING: This op expects unscaled logits, since it performs a softmax on logits internally for efficiency. Do not call this op with the output of softmax, as it will produce incorrect results.
Remove the softmax on y before feeding it into tf.nn.softmax_cross_entropy_with_logits
Also set your learning rate to something higher (like 3e-4)

Why deep NN can't approximate simple ln(x) function?

I have created ANN with two RELU hidden layers + linear activation layer and trying to approximate simple ln(x) function. And I am can't do this good. I am confused because lx(x) in x:[0.0-1.0] range should be approximated without problems (I am using learning rate 0.01 and basic grad descent optimization).
import tensorflow as tf
import numpy as np
def GetTargetResult(x):
curY = np.log(x)
return curY
# Create model
def multilayer_perceptron(x, weights, biases):
# Hidden layer with RELU activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1)
# # Hidden layer with RELU activation
layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
layer_2 = tf.nn.relu(layer_2)
# Output layer with linear activation
out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
return out_layer
# Parameters
learning_rate = 0.01
training_epochs = 10000
batch_size = 50
display_step = 500
# Network Parameters
n_hidden_1 = 50 # 1st layer number of features
n_hidden_2 = 10 # 2nd layer number of features
n_input = 1
# Store layers weight & bias
weights = {
'h1': tf.Variable(tf.random_uniform([n_input, n_hidden_1])),
'h2': tf.Variable(tf.random_uniform([n_hidden_1, n_hidden_2])),
'out': tf.Variable(tf.random_uniform([n_hidden_2, 1]))
}
biases = {
'b1': tf.Variable(tf.random_uniform([n_hidden_1])),
'b2': tf.Variable(tf.random_uniform([n_hidden_2])),
'out': tf.Variable(tf.random_uniform([1]))
}
x_data = tf.placeholder(tf.float32, [None, 1])
y_data = tf.placeholder(tf.float32, [None, 1])
# Construct model
pred = multilayer_perceptron(x_data, weights, biases)
# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(pred - y_data))
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train = optimizer.minimize(loss)
# Before starting, initialize the variables. We will 'run' this first.
init = tf.initialize_all_variables ()
# Launch the graph.
sess = tf.Session()
sess.run(init)
for step in range(training_epochs):
x_in = np.random.rand(batch_size, 1).astype(np.float32)
y_in = GetTargetResult(x_in)
sess.run(train, feed_dict = {x_data: x_in, y_data: y_in})
if(step % display_step == 0):
curX = np.random.rand(1, 1).astype(np.float32)
curY = GetTargetResult(curX)
curPrediction = sess.run(pred, feed_dict={x_data: curX})
curLoss = sess.run(loss, feed_dict={x_data: curX, y_data: curY})
print("For x = {0} and target y = {1} prediction was y = {2} and squared loss was = {3}".format(curX, curY,curPrediction, curLoss))
For the configuration above NN is just learning to guess y = -1.00. I have tried different learning rates, couple optimizers and different configurations with no success - learning does not converge in any case. I did something like that with logarithm in past in other deep learning framework without problem.. Can be the TF specific issue? What am I doing wrong?
What your network has to predict
Source: WolframAlpha
What your architecture is
ReLU(ReLU(x * W_1 + b_1) * W_2 + b_2)*W_out + b_out
Thoughts
My first thought was that ReLU is the problem. However, you don't apply relu to the output, so that should not cause the problem.
Changing the initialization (from uniform to normal) and the Optimizer (from SGD to ADAM) seems to fix the problem:
#!/usr/bin/env python
import tensorflow as tf
import numpy as np
def get_target_result(x):
return np.log(x)
def multilayer_perceptron(x, weights, biases):
"""Create model."""
# Hidden layer with RELU activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1)
# # Hidden layer with RELU activation
layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
layer_2 = tf.nn.relu(layer_2)
# Output layer with linear activation
out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
return out_layer
# Parameters
learning_rate = 0.01
training_epochs = 10**6
batch_size = 500
display_step = 500
# Network Parameters
n_hidden_1 = 50 # 1st layer number of features
n_hidden_2 = 10 # 2nd layer number of features
n_input = 1
# Store layers weight & bias
weights = {
'h1': tf.Variable(tf.truncated_normal([n_input, n_hidden_1], stddev=0.1)),
'h2': tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2], stddev=0.1)),
'out': tf.Variable(tf.truncated_normal([n_hidden_2, 1], stddev=0.1))
}
biases = {
'b1': tf.Variable(tf.constant(0.1, shape=[n_hidden_1])),
'b2': tf.Variable(tf.constant(0.1, shape=[n_hidden_2])),
'out': tf.Variable(tf.constant(0.1, shape=[1]))
}
x_data = tf.placeholder(tf.float32, [None, 1])
y_data = tf.placeholder(tf.float32, [None, 1])
# Construct model
pred = multilayer_perceptron(x_data, weights, biases)
# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(pred - y_data))
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# train = optimizer.minimize(loss)
train = tf.train.AdamOptimizer(1e-4).minimize(loss)
# Before starting, initialize the variables. We will 'run' this first.
init = tf.initialize_all_variables()
# Launch the graph.
sess = tf.Session()
sess.run(init)
for step in range(training_epochs):
x_in = np.random.rand(batch_size, 1).astype(np.float32)
y_in = get_target_result(x_in)
sess.run(train, feed_dict={x_data: x_in, y_data: y_in})
if(step % display_step == 0):
curX = np.random.rand(1, 1).astype(np.float32)
curY = get_target_result(curX)
curPrediction = sess.run(pred, feed_dict={x_data: curX})
curLoss = sess.run(loss, feed_dict={x_data: curX, y_data: curY})
print(("For x = {0} and target y = {1} prediction was y = {2} and "
"squared loss was = {3}").format(curX, curY,
curPrediction, curLoss))
Training this for 1 minute gave me:
For x = [[ 0.19118255]] and target y = [[-1.65452647]] prediction was y = [[-1.65021849]] and squared loss was = 1.85587377928e-05
For x = [[ 0.17362741]] and target y = [[-1.75084364]] prediction was y = [[-1.74087048]] and squared loss was = 9.94640868157e-05
For x = [[ 0.60853624]] and target y = [[-0.4966988]] prediction was y = [[-0.49964082]] and squared loss was = 8.65551464813e-06
For x = [[ 0.33864763]] and target y = [[-1.08279514]] prediction was y = [[-1.08586168]] and squared loss was = 9.4036658993e-06
For x = [[ 0.79126364]] and target y = [[-0.23412406]] prediction was y = [[-0.24541236]] and squared loss was = 0.000127425722894
For x = [[ 0.09994856]] and target y = [[-2.30309963]] prediction was y = [[-2.29796076]] and squared loss was = 2.6408026315e-05
For x = [[ 0.31053194]] and target y = [[-1.16946852]] prediction was y = [[-1.17038012]] and squared loss was = 8.31002580526e-07
For x = [[ 0.0512077]] and target y = [[-2.97186542]] prediction was y = [[-2.96796203]] and squared loss was = 1.52364455062e-05
For x = [[ 0.120253]] and target y = [[-2.11815739]] prediction was y = [[-2.12729549]] and squared loss was = 8.35050013848e-05
So the answer might be that your optimizer is not good / the optimization problem starts at a bad point. See
Xavier Glorot, Yoshua Bengio: Understanding the difficulty of training deep feedforward neural networks
Visualizing Optimization Algos
The following image is from Alec Radfords nice gifs. It does not contain ADAM, but you get a feeling for how much better one can do than SGD:
Two idea how this might be improved
try dropout
try not to use x values close to 0. I would rather sample values in [0.01, 1].
However, my experience with regression problems is quite limited.
First of all, your input data is in range [0, 1), which is not a good input to a neural network. Subtract mean from x after computing y to make it normalized (also ideally divide by standard deviation).
However, in your particular case it was not enough to make it work.
I played with it and found two ways to make it work (both require data normalization as described above):
Either completely remove the second layer
or
Make the number of neurons in the second layer 50.
My guess would be that 10 neurons do not have sufficient representation power to pass enough information to the last layer (obviously, a perfectly smart NN would learn to ignore the second layer in this case passing the answer in one of the neurons, but the theoretical possibility doesn't mean that gradient descent will learn to do so).
I have not look at the code but this is the theory. If you use an activation function like "tanh", then for small weights the activation function is in the linear region and for large weights the activation function is either -1 or +1. If you are in the linear region across all layers then you can not approximate complex functions (i.e. you have a sandwich of linear layers hence the best you can do is linear aproximations) but if you have bigger weights then the nonlinearly allow you to approximate a wide range of functions. There are no free lunches, the weights need to be at the right values to avoid over-fitting and under-fitting. This process is called regularization.

Error in Dimension for LSTM in tflearn

I am training PTB dataset for predicting characters (i.e. character-level LSTM).
The dimension for training batches is [len(dataset) x vocabulary_size]. Here, vocabulary_size = 27 (26+1[for unk tokens and spaces or fullstops.]).
This is the code for converting to one_hot for both batches input(arrX) and labels(arrY).
arrX = np.zeros((len(train_data), vocabulary_size), dtype=np.float32)
arrY = np.zeros((len(train_data)-1, vocabulary_size), dtype=np.float32)
for i, x in enumerate(train_data):
arrX[i, x] = 1
arrY = arrX[1, :]
I am making a placeholder of input(X) and labels(Y) in Graph to pass it to tflearn LSTM.Following is the code for the graph and session.
batch_size = 256
with tf.Graph().as_default():
X = tf.placeholder(shape=(None, vocabulary_size), dtype=tf.float32)
Y = tf.placeholder(shape=(None, vocabulary_size), dtype=tf.float32)
print (utils.get_incoming_shape(tf.concat(0, Y)))
print (utils.get_incoming_shape(X))
net = tflearn.lstm(X, 512, return_seq=True)
print (utils.get_incoming_shape(net))
net = tflearn.dropout(net, 0.5)
print (utils.get_incoming_shape(net))
net = tflearn.lstm(net, 256)
net = tflearn.fully_connected(net, vocabulary_size, activation='softmax')
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(net, Y))
optimizer = tf.train.AdamOptimizer(learning_rate=0.01).minimize(loss)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
offset=0
avg_cost = 0
total_batch = (train_length-1) / 256
print ("No. of batches:", '%d' %total_batch)
for i in range(total_batch) :
batch_xs, batch_ys = trainX[offset : batch_size + offset], trainY[offset : batch_size + offset]
sess.run(optimizer, feed_dict={X: batch_xs, Y: batch_ys})
cost = sess.run(loss, feed_dict={X: batch_xs, Y: batch_ys})
avg_cost += cost/total_batch
if i % 20 == 0:
print("Step:", '%03d' % i, "Loss:", str(cost))
offset += batch_size
SO, I get the following error assert ndim >= 3, "Input dim should be at least 3."
AssertionError: Input dim should be at least 3.
How can I resolve this error? Is there any alternate solution?
Should I write separate LSTM definition?
I'm not used to these kind of datasets but have you tried using the tflearn.input_data(shape) with the tflearn.embedding layer ? If you use the embedding I suppose that you won't have to reshape your data in 3 dimension.
lstm layer takes input of shape 3-D Tensor [samples, timesteps, input dim]. You can reshape your input data to 3D. In your problem shape of trainX is [len(dataset) x vocabulary_size]. Using trainX = trainX.reshape( trainX.shape+ (1,)) shape will be changed to [len(dataset), vocabulary_size, 1]. This data can be pass to lstm by simple change in input placeholder X. Add one more dimention to placeholder by X = tf.placeholder(shape=(None, vocabulary_size, 1), dtype=tf.float32).