Understanding TensorBoard (weight) histograms - tensorflow

It is really straightforward to see and understand the scalar values in TensorBoard. However, it's not clear how to understand histogram graphs.
For example, they are the histograms of my network weights.
(After fixing a bug thanks to sunside)
What is the best way to interpret these? Layer 1 weights look mostly flat, what does this mean?
I added the network construction code here.
X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
x_image = tf.reshape(X, [-1, 6, 10, 1])
tf.summary.image('input', x_image, 4)
# First layer of weights
with tf.name_scope("layer1"):
W1 = tf.get_variable("W1", shape=[input_size, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer1 = tf.matmul(X, W1)
layer1_act = tf.nn.tanh(layer1)
tf.summary.histogram("weights", W1)
tf.summary.histogram("layer", layer1)
tf.summary.histogram("activations", layer1_act)
# Second layer of weights
with tf.name_scope("layer2"):
W2 = tf.get_variable("W2", shape=[hidden_layer_neurons, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer2 = tf.matmul(layer1_act, W2)
layer2_act = tf.nn.tanh(layer2)
tf.summary.histogram("weights", W2)
tf.summary.histogram("layer", layer2)
tf.summary.histogram("activations", layer2_act)
# Third layer of weights
with tf.name_scope("layer3"):
W3 = tf.get_variable("W3", shape=[hidden_layer_neurons, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer3 = tf.matmul(layer2_act, W3)
layer3_act = tf.nn.tanh(layer3)
tf.summary.histogram("weights", W3)
tf.summary.histogram("layer", layer3)
tf.summary.histogram("activations", layer3_act)
# Fourth layer of weights
with tf.name_scope("layer4"):
W4 = tf.get_variable("W4", shape=[hidden_layer_neurons, output_size],
initializer=tf.contrib.layers.xavier_initializer())
Qpred = tf.nn.softmax(tf.matmul(layer3_act, W4)) # Bug fixed: Qpred = tf.nn.softmax(tf.matmul(layer3, W4))
tf.summary.histogram("weights", W4)
tf.summary.histogram("Qpred", Qpred)
# We need to define the parts of the network needed for learning a policy
Y = tf.placeholder(tf.float32, [None, output_size], name="input_y")
advantages = tf.placeholder(tf.float32, name="reward_signal")
# Loss function
# Sum (Ai*logp(yi|xi))
log_lik = -Y * tf.log(Qpred)
loss = tf.reduce_mean(tf.reduce_sum(log_lik * advantages, axis=1))
tf.summary.scalar("Q", tf.reduce_mean(Qpred))
tf.summary.scalar("Y", tf.reduce_mean(Y))
tf.summary.scalar("log_likelihood", tf.reduce_mean(log_lik))
tf.summary.scalar("loss", loss)
# Learning
train = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

It appears that the network hasn't learned anything in the layers one to three. The last layer does change, so that means that there either may be something wrong with the gradients (if you're tampering with them manually), you're constraining learning to the last layer by optimizing only its weights or the last layer really 'eats up' all error. It could also be that only biases are learned. The network appears to learn something though, but it might not be using its full potential. More context would be needed here, but playing around with the learning rate (e.g. using a smaller one) might be worth a shot.
In general, histograms display the number of occurrences of a value relative to each other values. Simply speaking, if the possible values are in a range of 0..9 and you see a spike of amount 10 on the value 0, this means that 10 inputs assume the value 0; in contrast, if the histogram shows a plateau of 1 for all values of 0..9, it means that for 10 inputs, each possible value 0..9 occurs exactly once.
You can also use histograms to visualize probability distributions when you normalize all histogram values by their total sum; if you do that, you'll intuitively obtain the likelihood with which a certain value (on the x axis) will appear (compared to other inputs).
Now for layer1/weights, the plateau means that:
most of the weights are in the range of -0.15 to 0.15
it is (mostly) equally likely for a weight to have any of these values, i.e. they are (almost) uniformly distributed
Said differently, almost the same number of weights have the values -0.15, 0.0, 0.15 and everything in between. There are some weights having slightly smaller or higher values.
So in short, this simply looks like the weights have been initialized using a uniform distribution with zero mean and value range -0.15..0.15 ... give or take. If you do indeed use uniform initialization, then this is typical when the network has not been trained yet.
In comparison, layer1/activations forms a bell curve (gaussian)-like shape: The values are centered around a specific value, in this case 0, but they may also be greater or smaller than that (equally likely so, since it's symmetric). Most values appear close around the mean of 0, but values do range from -0.8 to 0.8.
I assume that the layer1/activations is taken as the distribution over all layer outputs in a batch. You can see that the values do change over time.
The layer 4 histogram doesn't tell me anything specific. From the shape, it's just showing that some weight values around -0.1, 0.05 and 0.25 tend to be occur with a higher probability; a reason could be, that different parts of each neuron there actually pick up the same information and are basically redundant. This can mean that you could actually use a smaller network or that your network has the potential to learn more distinguishing features in order to prevent overfitting. These are just assumptions though.
Also, as already stated in the comments below, do add bias units. By leaving them out, you are forcefully constraining your network to a possibly invalid solution.

Here I would indirectly explain the plot by giving a minimal example. The following code produce a simple histogram plot in tensorboard.
from datetime import datetime
import tensorflow as tf
filename = datetime.now().strftime("%Y%m%d-%H%M%S")
fw = tf.summary.create_file_writer(f'logs/fit/{filename}')
with fw.as_default():
for i in range(10):
t = tf.random.uniform((2, 2), 1000)
tf.summary.histogram(
"train/hist",
t,
step=i
)
print(t)
We see that generating a 2x2 matrix with a maximum range 1000 will produce values from 0-1000. To how this tensor might look, i am putting log of a few of them here.
tf.Tensor(
[[398.65747 939.9828 ]
[942.4269 59.790222]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[869.5309 980.9699 ]
[149.97845 454.524 ]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[967.5063 100.77594 ]
[ 47.620544 482.77008 ]], shape=(2, 2), dtype=float32)
We logged into tensorboard 10 times. The to right of the plot, a timeline is generated to indicate timesteps. The depth of histogram indicate which values are new. The lighter/front values are newer and darker/far values are older.
Values are gathered into buckets which are indicated by those triangle structures. x-axis indicate the range of values where the bunch lies.

Related

Does Keras masking impact weight updates and loss calcuations?

I'm working with time series, and understand that keras.layers.Masking and keras.layers.Embedding are useful to create a mask value in the network which indicates timesteps to 'skip'. The mask value is propagated throughout the network to be used by any layers that support it.
The Keras documentation doesn't specify any further impacts of the mask value. My expectation is that the mask would be applied through all functions in model training and evaluation, but I don't see any evidence in support of this.
Does the mask value impact back-propagation?
Does the mask value impact the loss function or the metrics?
Would it be wise or foolish to use the sample_weight parameter in model.compile() to tell Keras to 'ignore' the masked timesteps in the loss function?
I've performed some experiments to answer these questions.
Here's my sample code:
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
# Fix the random seed for repeatable results
np.random.seed(5)
tf.random.set_seed(5)
x = np.array([[[3, 0], [1, 4], [3, 2], [4, 0], [4, 5]],
[[1, 2], [3, 1], [1, 3], [5, 1], [3, 5]]], dtype='float64')
# Choose some values to be masked out
mask = np.array([[False, False, True, True, True],
[ True, True, False, False, True]]) # True:keep. False:ignore
samples, timesteps, features_in = x.shape
features_out = 1
y_true = np.random.rand(samples, timesteps, features_out)
# y_true[~mask] = 1e6 # TEST MODIFICATION
# Apply the mask to x
mask_value = 0 # Set to any value
x[~mask] = [mask_value] * features_in
input_tensor = keras.Input(shape=(timesteps, features_in))
this_layer = input_tensor
this_layer = keras.layers.Masking(mask_value=mask_value)(this_layer)
this_layer = keras.layers.Dense(10)(this_layer)
this_layer = keras.layers.Dense(features_out)(this_layer)
model = keras.Model(input_tensor, this_layer)
model.compile(loss='mae', optimizer='adam')
model.fit(x=x, y=y_true, epochs=100, verbose=0)
y_pred = model.predict(x)
print("y_pred = ")
print(y_pred)
print("model weights = ")
print(model.get_weights()[1])
print(f"{'model.evaluate':>14s} = {model.evaluate(x, y_true, verbose=0):.5f}")
# See if the loss computed by model.evaluate() is equal to the masked loss
error = y_true - y_pred
masked_loss = np.abs(error[mask]).mean()
unmasked_loss = np.abs(error).mean()
print(f"{'masked loss':>14s} = {masked_loss:.5f}")
print(f"{'unmasked loss':>14s} = {unmasked_loss:.5f}")
Which outputs
y_pred =
[[[-0.28896046]
[-0.28896046]
[ 0.1546848 ]
[-1.1596009 ]
[ 1.5819632 ]]
[[ 0.59000516]
[-0.39362794]
[-0.28896046]
[-0.28896046]
[ 1.7996234 ]]]
model weights =
[-0.06686568 0.06484845 -0.06918766 0.06470951 0.06396528 0.06470013
0.06247645 -0.06492618 -0.06262784 -0.06445726]
model.evaluate = 0.60170
masked loss = 1.00283
unmasked loss = 0.90808
mask and loss calculation
Surprisingly, the 'mae' (mean absolute error) loss calculation does NOT exclude the masked timesteps from the calculation. Instead, it assumes that these timesteps have zero loss - a perfect prediction. Therefore, every masked timestep actually reduces the calculated loss!
To explain in more detail: the above sample code input x has 10 timesteps. 4 of them are removed by the mask, so 6 valid timesteps remain. The 'mean absolute error' loss calculation sums the losses for the 6 valid timesteps, then divides by 10 instead of dividing by 6. This looks like a bug to me.
output values are masked
Output values of masked timesteps do not impact the model training or evaluation (as it should be).
This can be easily tested by setting:
y_true[~mask] = 1e6
The model weights, predictions and losses remain exactly the same.
input values are masked
Input values of masked timesteps do not impact the model training or evaluation (as it should be).
Similarly, I can change mask_value from 0 to any other number, and the resulting model weights, predictions, and losses remain exactly the same.
In summary:
Q1: Effectively yes - the mask impacts the loss function, which is used through backpropagation to update the weights.
Q2: Yes, but the mask impacts the loss in an unexpected way.
Q3: Initially foolish - the mask should already be applied to the loss calculation. However, perhaps sample_weights could be valuable to correct the unexpected method of the loss calculation...
Note that I'm using Tensorflow 2.7.0.
I have been struggling through this on a related issue, namely implementing a mask to a multi-output model where some samples are missing labels for different outputs. Here, construct features, labels, sample_weights from a dataset and labels and sample_weights are dictionaries with equivalent keys. The weights are 0,1 for each sample indicating if it should contribute to the calculation for the relevant loss.
I had hoped that sample_weights would contribute to the loss as they do when I pass the metric equivalents for the losses via weight_metrics in model.compile
I've found that sample_weight does not seem to address this problem. I can tell from the training metrics that the task_loss values are different from task_metric values when sample weights are used.
I've given up on this and decided to go ahead and use masking. The masked loss values are low in your case (and in mine) because tensorflow sees the modeled output as perfection - I hope this means it does not see a gradient for this points and so parameters aren't tuned in response.

How to map an array of values for y_true to a single value in order to compare to y_pred in a Tensorflow loss function (Tensorflow/Tensorflow Quantum)

I am trying to implement the circuits listed on page 8 in the following paper: https://arxiv.org/pdf/1905.10876.pdf using Tensorflow Quantum (TFQ). I have done so previously for a subset of circuits using Qiskit, and ended up with accuracies that can be found on page 14 in the following paper: https://arxiv.org/pdf/2003.09887.pdf. In TFQ, my accuracies are way down. I think this delta originates because in TFQ, I only used 1 observable Pauli Z operator on the first qubit, and the circuits do not seem to "transfer all knowledge" to the first qubit. I place this in quotes, because I am sure there is a better way to describe this. In Qiskit on the other hand, 16 states (4^2) get mapped to 2 states.
My question: how can I get my accuracies back up?
Potential answer a): some method of "transferring all information" to a single qubit, potentially an ancilla qubit, and doing a readout on this qubit.
Potential answer b) placing a Pauli Z observable on all qubits (4 in total), mapping half of the 16 states to a label 0 and the other half to a label 1. I attempted this in the code below.
My attempt at answer b):
I have a Tensorflow Quantum (TFQ) circuit implemented in Tensorflow. The circuit has multiple observables, which I try to bring together in my loss function. I prefer to use as many standard components as possible, but need to map my quantum states to a label in order to determine the loss. I think what I am trying to achieve is not unique to TFQ. I define my model in the following way:
def circuit():
data_qubits = cirq.GridQubit.rect(4, 1)
circuit = cirq.Circuit()
...
return circuit, [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])]
model_circuit, model_readout = circuit()
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(), dtype=tf.string),
# The PQC layer returns the expected value of the readout gate, range [-1,1].
tfq.layers.PQC(model_circuit, model_readout),
])
# compile model
model.compile(
loss = loss_mse,
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
metrics=[])
in loss_mse (Mean Square Error), I receive a (32, 4) tensor for y_pred. One row could look like
[-0.2, 0.33, 0.6, 0.3]
This would have to be first mapped from [-1,1] to a binarized version of [0,1], so that it looks like:
[0, 1, 1, 1]
Now, a table lookup needs to happen, which tells if this combination is 0 or 1. Finally, the regular (y_true-y_pred)^2 can be performed by that row, followed by a np.sum on all rows. I tried to implement this:
def get_label(measurement):
if measurement == [0,0,0,0]: return 0
...
elif measurement == [1,1,1,1]: return 0
else: return -1
def py_call(y_true, y_pred):
# cast tensor to numpy
y_pred_np = np.asarray(y_pred)
loss = np.zeros((len(y_pred))) # could be a single variable with += within the loop
# evalaute all 32 samples
for pred in range(len(y_pred_np)):
# map, binarize and lookup
y_labelled = get_label([0 if y<0 else 1 for y in y_pred_np[pred]])
# regular loss comparison
loss[pred] = (y_labelled - y_true[pred])**2
# reduce
loss = np.sum(loss)/len(y_true)
return loss
#tf.function
def loss_mse(y_true, y_pred):
external_list = []
loss = tf.py_function(py_call, inp=[y_true, y_pred], Tout=[tf.float64])
return loss
However, the system appears to still expect a (32,4) tensor. I would have thought I could simply provide a single loss values (float). My question: how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function?
So it looks like there are a couple of things going on here. To answer your question
how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function ?
What you might want is some kind of tf.reduce_* function like tf.reduce_mean or tf.reduce_sum. This function will allow you to apply this reduction operation accross a given tensor axis allowing you to convert a tensor of shape (32, 4) to a tensor of shape (32,) or a tensor of shape (4,). Here is a quick snippet:
#tf.function
def my_loss(y_true, y_pred):
# y_true is shape (32, 4)
# y_pred is shape (32, 4)
# Scale from [-1, 1] to [0, 1]
y_true += 1
y_true /= 2
y_pred += 1
y_pred /= 2
# These are now both (32,) with the reduction of taking the mean applied along
# the second axis.
reduced_true = tf.reduce_mean(y_true, axis=1)
reduced_pred = tf.reduce_mean(y_pred, axis=1)
# Now a scalar loss.
loss = tf.reduce_mean((reduce_true - reduced_pred) ** 2)
return loss
Now the above isn't exactly what you want, since it's not super clear to me at least what exact reduction rules you have in mind for taking something like [0,1,1,1] -> 0 vs [0,0,0,0] -> 1.
Another thing I will also mention is that if you want JUST the sum of these Pauli Operators in cirq that you have term by term in the list [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])] and all you care about is the final sum of these expectations, you could just as easily do:
my_operator = sum([cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]),
cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])])
print(my_op)
Which should give something like:
cirq.PauliSum(cirq.LinearDict({frozenset({(cirq.GridQubit(0, 0), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 1), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 2), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 3), cirq.Z)}): (1+0j)}))
Which is also compatable as a readout operation in the PQC layer. Lastly if would recommend reading through some of the snippets and examples here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/PQC
and here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/Expectation
Which give a pretty good description of how the input and output signatures of the functions look as well as the shapes you can expect from them.

Increase dimension of RNN LSTM cell in Keras

I want to increase amount of recurrent weights in rnn or lstm cell.
If you look at the code below you will see, that lsrm cell inputs shape is (2,1), which means 2 timesteps and 1 feature.
%tensorflow_version 2.x
import tensorflow as tf
m = tf.keras.models.Sequential()
lstm = tf.keras.layers.LSTM(1, use_bias=False)
input = tf.keras.Input(shape=(2,1))
m.add(input)
m.add(lstm)
lstm.get_weights()
The output is
[array([[ 0.878217 , 0.89324415, 0.404307 , -1.0542995 ]], dtype=float32),
array([[-0.24181306, -0.341401 , 0.65207034, 0.63227856]], dtype=float32)]
4 weights for each feature, and 4 weights for previous outputs
Now if I change Input shape like this
input = tf.keras.Input(shape=(2,1))
then the output of get_weights function will be like this:
[array([[-0.9725287 , -0.90078545, 0.97881985, -0.9623983 ],
[-0.9644511 , 0.90705967, 0.05965471, 0.32613564]], dtype=float32),
array([[-0.24867296, -0.22346373, -0.6410606 , 0.69084513]], dtype=float32)]
Now my question is: how do I increase amount of weights in the second array whick keeps the (4,1) shape?
The idea is that I want RNN or STRM take not only the previous output (t-1 moment) but more prevois values like (t-2, t-3, t-4) moments.
Is there way to do it in keras with tf backend?
I can't understand the change, I think you had a typo in your question, but:
Length - Time steps:
The number of time steps will never change the number of weights. The layer is "recurrent", meaning it will "loop" the time steps. It's not supposed to have different weights for each step.
The whole purpose of the layer is to apply the same operations over and over and over for each time step.
Input features:
Input features are the last dimension of the input. They define one dimension of the weights.
Units = Output features:
Output features, also the last dimension of the output, are another dimension of the weights.
Two types of kernels
The LSTM layers have two groups of kernels:
What they call simply kernels - with shape=(input_dim, self.units * 4)
What they call recurrent kernels - with shape=(self.units, self.units * 4)
The first group acts on the input data, they have shape considering the input features and the output features.
The second group acts on inner states and have shapes considering only the output features (units).
From the source code:
self.kernel = self.add_weight(shape=(input_dim, self.units * 4),
name='kernel',
initializer=self.kernel_initializer,
regularizer=self.kernel_regularizer,
constraint=self.kernel_constraint)
self.recurrent_kernel = self.add_weight(
shape=(self.units, self.units * 4),
name='recurrent_kernel',
initializer=self.recurrent_initializer,
regularizer=self.recurrent_regularizer,
constraint=self.recurrent_constraint)
The last array in the list:
The last array in the list of weights are the 4 recurrent kernels with shape (1, 1) grouped into one.
So:
You can increase the kernels with more input features. Transform Input((anything, 1)) into Input((anything, more)) for instance.
You can increase the kernels and the recurrent_kernels (and biases, when considered) with bigger output features. Transform LSTM(1, ...) into LSTM(more, ...)
Weights are independent of the lenght. It's even possible to have Input((None, 1)), meaning a variable length.
Using more than just the last step
This should be automatic. LSTM layers are designed to have memory. The memory is an inner state that participates in all time steps. There are gates (the kernels) that decide how a new step will participate in this memory. Since all steps participate in the same memory, LSTM layer theoretically considers "all" time steps from the beginning.
So, you shouldn't really worry with this.
But if you do want this, there are maybe two ways. Don't know if they will bring any improvement, though.
One is to concatenate shifted inputs as features:
def pad_and_shift(x):
steps = 3
paddings = tf.constant([[0,0], [steps-1, 0], [0, 0]])
x = tf.pad(x, paddings)
to_concat = [ x[:,i:i - steps + 1] for i in range(steps-1) ]
to_concat += x[:, steps-1:]
return tf.concat(to_concat, axis=-1)
given_inputs = ....
out = Lambda(pad_and_shift)(given_inputs)
out = LSTM(units, ...)(out)
The other involves editing the source code of the LSTM, which would be very complicated and probably not very worthy.

How can I improve my LSTM accuracy in Tensorflow

I'm trying to figure out how to decrease the error in my LSTM. It's an odd use-case because rather than classifying, we are taking in short lists (up to 32 elements long) and outputting a series of real numbers, ranging from -1 to 1 - representing angles. Essentially, we want to reconstruct short protein loops from amino acid inputs.
In the past we had redundant data in our datasets, so the accuracy reported was incorrect. Since removing the redundant data our validation accuracy has gotten much worse, which suggests our network had learned to memorise the most frequent examples.
Our dataset is 10,000 items, split 70/20/10 between train, validation and test. We use a bi-directional, LSTM as follows:
x = tf.cast(tf_train_dataset, dtype=tf.float32)
output_size = FLAGS.max_cdr_length * 4
dmask = tf.placeholder(tf.float32, [None, output_size], name="dmask")
keep_prob = tf.placeholder(tf.float32, name="keepprob")
sizes = [FLAGS.lstm_size,int(math.floor(FLAGS.lstm_size/2)),int(math.floor(FLAGS.lstm_size/ 4))]
single_rnn_cell_fw = tf.contrib.rnn.MultiRNNCell( [lstm_cell(sizes[i], keep_prob, "cell_fw" + str(i)) for i in range(len(sizes))])
single_rnn_cell_bw = tf.contrib.rnn.MultiRNNCell( [lstm_cell(sizes[i], keep_prob, "cell_bw" + str(i)) for i in range(len(sizes))])
length = create_length(x)
initial_state = single_rnn_cell_fw.zero_state(FLAGS.batch_size, dtype=tf.float32)
initial_state = single_rnn_cell_bw.zero_state(FLAGS.batch_size, dtype=tf.float32)
outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw=single_rnn_cell_fw, cell_bw=single_rnn_cell_bw, inputs=x, dtype=tf.float32, sequence_length = length)
output_fw, output_bw = outputs
states_fw, states_bw = states
output_fw = last_relevant(FLAGS, output_fw, length, "last_fw")
output_bw = last_relevant(FLAGS, output_bw, length, "last_bw")
output = tf.concat((output_fw, output_bw), axis=1, name='bidirectional_concat_outputs')
test = tf.placeholder(tf.float32, [None, output_size], name="train_test")
W_o = weight_variable([sizes[-1]*2, output_size], "weight_output")
b_o = bias_variable([output_size],"bias_output")
y_conv = tf.tanh( ( tf.matmul(output, W_o)) * dmask, name="output")
Essentially, we use 3 layers of LSTM, with 256, 128 and 64 units each. We take the last step of both the Forward and Backward passes and concatenate them together. These feed into a final, fully connected layer that presents the data in the way we need it. We use a mask to set these steps we don't need to zero.
Our cost function uses a mask again, and takes the mean of the squared difference. We build the mask from the test data. Values to ignore are set to -3.0.
def cost(goutput, gtest, gweights, FLAGS):
mask = tf.sign(tf.add(gtest,3.0))
basic_error = tf.square(gtest-goutput) * mask
basic_error = tf.reduce_sum(basic_error)
basic_error /= tf.reduce_sum(mask)
return basic_error
To train the net I've used a variety of optimizers. The lowest scores have been obtained with the AdamOptimizer. The others, such as Adagrad, Adadelta, RMSProp tend to flatline around 0.3/0.4 error which is not particularly great.
Our learning rate is 0.004, batch size of 200. We use a 0.5 probability dropout layer.
I've tried adding more layers, changing learning rates, batch sizes, even the representation of the data. I've attempted batch regularisation, L1 and L2 weight regularisation (though perhaps incorrectly) and I've even considered switching to a convnet approach instead.
Nothing seems to make any difference. What has seemed to work is changing the optimizer. Adam seems noisier as it improves, but it does get closer than the other optimizers.
We need to get down to a value much closer to 0.05 or 0.01. Sometimes the training error touches 0.09 but the validation doesn't follow. I've run this network for about 500 epochs so far (about 8 hours) and it tends to settle around 0.2 validation error.
I'm not quite sure what to attempt next. Decayed learning rate might help but I suspect there is something more fundamental I need to do. It could be something as simple as a bug in the code - I need to double check the masking,

Not understanding code used in TensorFlow MNIST guide

I'm reading through the MNIST TensorFlow guide, and trying to get a good understanding of what's going on.
The first set of steps, with added comments, looks like this:
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
# Download the data set.
# Comprises thousands of images, each with a label.
# Our images are 28x28, so we have 784 pixels in total.
# one_hot means our labels are treated as a vector with a
# length of 10. e.g. for the number 4, it'd be
# [0, 0, 0, 0, 1, 0, 0, 0, 0, 0].
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
# x isn't a specific value. It's a placeholder, a value that
# we'll input when we ask TensorFlow to run a computation.
# We want to input any number of MNIST images, each flattened
# into a 784-dimensional vector (e.g. an array made up of a
# double for each pixel, representing pixel brightness).
# Takes the form of [Image, Pixel].
x = tf.placeholder(tf.float32, [None, 784])
# Variables are modifiable tensors, which live in TensorFlow's
# graph of interacting operations. It can be used and modified
# by the computation. Model parameters are usually set as Variables.
# Weights
# Takes the form of [Pixel, Digit]
W = tf.Variable(tf.zeros([784, 10]))
# Biases
# Takes the form of [Digit]
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)
So now I'm trying to breakdown this last line to figure out what's going on.
They provide this diagram:
Ignoring the softmax step, and ignoring the adding of biases, so just looking at that top line:
(W1,1 * x1) + (W1,2 * x2) + (W1,3 * x3).
Since x is now 1-dimensional, I'll assume it's specific to a particular image, and so the x value is each pixel within that image. We thus have:
(Weight of 1st pixel for 1st digit * value of 1st pixel) + (Weight of 1st pixel for 2nd digit * value of 2nd pixel) + (Weight of 1st pixel for 3rd digit * value of 3rd pixel)
This doesn't seem right. The weight tensor's first dimension representing pixels, where the x tensor's second dimension represents pixels, means we're multiplying the values of different pixels... this doesn't make any sense to me.
Am I misunderstanding something here?
This model is very simple and probably isn't worth of in-depth discussion, but your conclusion isn't correct. Pixel values are never multiplied. This is a linear model:
tf.matmul(x, W) + b
... which naively assumes an image is a bunch of independent pixels. Each pixel gets multiplied by different weights corresponding to 10 classes. In other words, this linear layer assigns a weight to each (pixel, class) pair. This directly corresponds to its shape: [784, 10] (I'm ignoring the bias term for simplicity).
As a result of this multiplication, a final 10-length vector contains the scores for each class. Each score takes into account each pixel, more precisely it's a weighted sum of all pixel values. The score then goes to the loss function to compare the output with the ground truth, so that in the next iteration we could tweak those weights in the right direction.
Though it's very simple, it is still a reasonable approach.