I'm reading through the MNIST TensorFlow guide, and trying to get a good understanding of what's going on.
The first set of steps, with added comments, looks like this:
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
# Download the data set.
# Comprises thousands of images, each with a label.
# Our images are 28x28, so we have 784 pixels in total.
# one_hot means our labels are treated as a vector with a
# length of 10. e.g. for the number 4, it'd be
# [0, 0, 0, 0, 1, 0, 0, 0, 0, 0].
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
# x isn't a specific value. It's a placeholder, a value that
# we'll input when we ask TensorFlow to run a computation.
# We want to input any number of MNIST images, each flattened
# into a 784-dimensional vector (e.g. an array made up of a
# double for each pixel, representing pixel brightness).
# Takes the form of [Image, Pixel].
x = tf.placeholder(tf.float32, [None, 784])
# Variables are modifiable tensors, which live in TensorFlow's
# graph of interacting operations. It can be used and modified
# by the computation. Model parameters are usually set as Variables.
# Weights
# Takes the form of [Pixel, Digit]
W = tf.Variable(tf.zeros([784, 10]))
# Biases
# Takes the form of [Digit]
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)
So now I'm trying to breakdown this last line to figure out what's going on.
They provide this diagram:
Ignoring the softmax step, and ignoring the adding of biases, so just looking at that top line:
(W1,1 * x1) + (W1,2 * x2) + (W1,3 * x3).
Since x is now 1-dimensional, I'll assume it's specific to a particular image, and so the x value is each pixel within that image. We thus have:
(Weight of 1st pixel for 1st digit * value of 1st pixel) + (Weight of 1st pixel for 2nd digit * value of 2nd pixel) + (Weight of 1st pixel for 3rd digit * value of 3rd pixel)
This doesn't seem right. The weight tensor's first dimension representing pixels, where the x tensor's second dimension represents pixels, means we're multiplying the values of different pixels... this doesn't make any sense to me.
Am I misunderstanding something here?
This model is very simple and probably isn't worth of in-depth discussion, but your conclusion isn't correct. Pixel values are never multiplied. This is a linear model:
tf.matmul(x, W) + b
... which naively assumes an image is a bunch of independent pixels. Each pixel gets multiplied by different weights corresponding to 10 classes. In other words, this linear layer assigns a weight to each (pixel, class) pair. This directly corresponds to its shape: [784, 10] (I'm ignoring the bias term for simplicity).
As a result of this multiplication, a final 10-length vector contains the scores for each class. Each score takes into account each pixel, more precisely it's a weighted sum of all pixel values. The score then goes to the loss function to compare the output with the ground truth, so that in the next iteration we could tweak those weights in the right direction.
Though it's very simple, it is still a reasonable approach.
Related
I am trying to implement the circuits listed on page 8 in the following paper: https://arxiv.org/pdf/1905.10876.pdf using Tensorflow Quantum (TFQ). I have done so previously for a subset of circuits using Qiskit, and ended up with accuracies that can be found on page 14 in the following paper: https://arxiv.org/pdf/2003.09887.pdf. In TFQ, my accuracies are way down. I think this delta originates because in TFQ, I only used 1 observable Pauli Z operator on the first qubit, and the circuits do not seem to "transfer all knowledge" to the first qubit. I place this in quotes, because I am sure there is a better way to describe this. In Qiskit on the other hand, 16 states (4^2) get mapped to 2 states.
My question: how can I get my accuracies back up?
Potential answer a): some method of "transferring all information" to a single qubit, potentially an ancilla qubit, and doing a readout on this qubit.
Potential answer b) placing a Pauli Z observable on all qubits (4 in total), mapping half of the 16 states to a label 0 and the other half to a label 1. I attempted this in the code below.
My attempt at answer b):
I have a Tensorflow Quantum (TFQ) circuit implemented in Tensorflow. The circuit has multiple observables, which I try to bring together in my loss function. I prefer to use as many standard components as possible, but need to map my quantum states to a label in order to determine the loss. I think what I am trying to achieve is not unique to TFQ. I define my model in the following way:
def circuit():
data_qubits = cirq.GridQubit.rect(4, 1)
circuit = cirq.Circuit()
...
return circuit, [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])]
model_circuit, model_readout = circuit()
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(), dtype=tf.string),
# The PQC layer returns the expected value of the readout gate, range [-1,1].
tfq.layers.PQC(model_circuit, model_readout),
])
# compile model
model.compile(
loss = loss_mse,
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
metrics=[])
in loss_mse (Mean Square Error), I receive a (32, 4) tensor for y_pred. One row could look like
[-0.2, 0.33, 0.6, 0.3]
This would have to be first mapped from [-1,1] to a binarized version of [0,1], so that it looks like:
[0, 1, 1, 1]
Now, a table lookup needs to happen, which tells if this combination is 0 or 1. Finally, the regular (y_true-y_pred)^2 can be performed by that row, followed by a np.sum on all rows. I tried to implement this:
def get_label(measurement):
if measurement == [0,0,0,0]: return 0
...
elif measurement == [1,1,1,1]: return 0
else: return -1
def py_call(y_true, y_pred):
# cast tensor to numpy
y_pred_np = np.asarray(y_pred)
loss = np.zeros((len(y_pred))) # could be a single variable with += within the loop
# evalaute all 32 samples
for pred in range(len(y_pred_np)):
# map, binarize and lookup
y_labelled = get_label([0 if y<0 else 1 for y in y_pred_np[pred]])
# regular loss comparison
loss[pred] = (y_labelled - y_true[pred])**2
# reduce
loss = np.sum(loss)/len(y_true)
return loss
#tf.function
def loss_mse(y_true, y_pred):
external_list = []
loss = tf.py_function(py_call, inp=[y_true, y_pred], Tout=[tf.float64])
return loss
However, the system appears to still expect a (32,4) tensor. I would have thought I could simply provide a single loss values (float). My question: how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function?
So it looks like there are a couple of things going on here. To answer your question
how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function ?
What you might want is some kind of tf.reduce_* function like tf.reduce_mean or tf.reduce_sum. This function will allow you to apply this reduction operation accross a given tensor axis allowing you to convert a tensor of shape (32, 4) to a tensor of shape (32,) or a tensor of shape (4,). Here is a quick snippet:
#tf.function
def my_loss(y_true, y_pred):
# y_true is shape (32, 4)
# y_pred is shape (32, 4)
# Scale from [-1, 1] to [0, 1]
y_true += 1
y_true /= 2
y_pred += 1
y_pred /= 2
# These are now both (32,) with the reduction of taking the mean applied along
# the second axis.
reduced_true = tf.reduce_mean(y_true, axis=1)
reduced_pred = tf.reduce_mean(y_pred, axis=1)
# Now a scalar loss.
loss = tf.reduce_mean((reduce_true - reduced_pred) ** 2)
return loss
Now the above isn't exactly what you want, since it's not super clear to me at least what exact reduction rules you have in mind for taking something like [0,1,1,1] -> 0 vs [0,0,0,0] -> 1.
Another thing I will also mention is that if you want JUST the sum of these Pauli Operators in cirq that you have term by term in the list [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])] and all you care about is the final sum of these expectations, you could just as easily do:
my_operator = sum([cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]),
cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])])
print(my_op)
Which should give something like:
cirq.PauliSum(cirq.LinearDict({frozenset({(cirq.GridQubit(0, 0), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 1), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 2), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 3), cirq.Z)}): (1+0j)}))
Which is also compatable as a readout operation in the PQC layer. Lastly if would recommend reading through some of the snippets and examples here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/PQC
and here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/Expectation
Which give a pretty good description of how the input and output signatures of the functions look as well as the shapes you can expect from them.
I have a model that outputs a Softmax, and I would like to develop a custom loss function. The desired behaviour would be:
1) Softmax to one-hot (normally I do numpy.argmax(softmax_vector) and set that index to 1 in a null vector, but this is not allowed in a loss function).
2) Multiply the resulting one-hot vector by my embedding matrix to get an embedding vector (in my context: the word-vector that is associated to a given word, where words have been tokenized and assigned to indices, or classes for the Softmax output).
3) Compare this vector with the target (this could be a normal Keras loss function).
I know how to write a custom loss function in general, but not to do this. I found this closely related question (unanswered), but my case is a bit different, since I would like to preserve my softmax output.
It is possible to mix tensorflow and keras in you customer loss function. Once you can access to all Tensorflow function, things become very easy. I just give you a example of how this function could be imlement.
import tensorflow as tf
def custom_loss(target, softmax):
max_indices = tf.argmax(softmax, -1)
# Get the embedding matrix. In Tensorflow, this can be directly done
# with tf.nn.embedding_lookup
embedding_vectors = tf.nn.embedding_lookup(you_embedding_matrix, max_indices)
# Do anything you want with normal keras loss function
loss = some_keras_loss_function(target, embedding_vectors)
loss = tf.reduce_mean(loss)
return loss
Fan Luo's answer points in the right direction, but ultimately will not work because it involves non-derivable operations. Note such operations are acceptable for the real value (a loss function takes a real value and a predicted value, non-derivable operations are only fine for the real value).
To be fair, that was what I was asking in the first place. It is not possible to do what I wanted, but we can get a similar and derivable behaviour:
1) Element-wise power of the softmax values. This makes smaller values much smaller. For example, with a power of 4 [0.5, 0.2, 0.7] becomes [0.0625, 0.0016, 0.2400]. Note that 0.2 is comparable to 0.7, but 0.0016 is negligible with respect to 0.24. The higher my_power is, the more similar to a one-hot the final result will be.
soft_extreme = Lambda(lambda x: x ** my_power)(softmax)
2) Importantly, both softmax and one-hot vectors are normalized, but not our "soft_extreme". First, find the sum of the array:
norm = tf.reduce_sum(soft_extreme, 1)
3) Normalize soft_extreme:
almost_one_hot = Lambda(lambda x: x / norm)(soft_extreme)
Note: Setting my_power too high in 1) will result in NaNs. If you need a better softmax to one-hot conversion, then you may do steps 1 to 3 two or more times in a row.
4) Finally we want the vector from the dictionary. Lookup is forbidden, but we can take the average vector using matrix multiplication. Because our soft_normalized is similar to one-hot encoding this average will be similar to the vector associated to the highest argument (original intended behaviour). The higher my_power is in (1), the truer this will be:
target_vectors = tf.tensordot(almost_one_hot, embedding_matrix, axes=[[1], [0]])
Note: This will not work directly using batches! In my case, I reshaped my "one hot" (from [batch, dictionary_length] to [batch, 1, dictionary_length] using tf.reshape. Then tiled my embedding_matrix batch times and finally used:
predicted_vectors = tf.matmul(reshaped_one_hot, tiled_embedding)
There may be more elegant solutions (or less memory-hungry, if tiling the embedding matrix is not an option), so feel free to explore more.
My understanding of noise contrastive estimation is that we sample some vectors from our word embeddings (the negative sample), and then calculate the log-likelihood of each. Then we want to maximize the difference between the probability of the target word and the log-likelihood of each of the negative sample words (So if I'm correct about this, we want to optimize the loss function so that it gets as close to 1 as possible).
My question is this:
What is the purpose of the num_classes parameters to the nce_loss function? My best guess is that the number of classes is passed in so that Tensorflow knows the size of the distribution from which the negative samples our drawn, but this might not make sense, since we could just infer the size of the distribution from the variable itself. Otherwise, I can't think of a reason for why we would need to know the total possible number of classes, especially if the language model is only outputting k + 1 predictions (negative sample size + 1 for the target word).
Your guess is correct. The num_classes argument is used to sample negative labels from the log-uniform (Zipfian) distribution.
Here's the link to the source code:
# Sample the negative labels.
# sampled shape: [num_sampled] tensor
# true_expected_count shape = [batch_size, 1] tensor
# sampled_expected_count shape = [num_sampled] tensor
if sampled_values is None:
sampled_values = candidate_sampling_ops.log_uniform_candidate_sampler(
true_classes=labels,
num_true=num_true,
num_sampled=num_sampled,
unique=True,
range_max=num_classes)
The range_max=num_classes argument basically defines the shape of this distribution and also the range of the sampled values - [0, range_max). Note that this range can't be accurately inferred from the labels, because a particular mini-batch can have only small word ids, which would skew the distribution significantly.
It is really straightforward to see and understand the scalar values in TensorBoard. However, it's not clear how to understand histogram graphs.
For example, they are the histograms of my network weights.
(After fixing a bug thanks to sunside)
What is the best way to interpret these? Layer 1 weights look mostly flat, what does this mean?
I added the network construction code here.
X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
x_image = tf.reshape(X, [-1, 6, 10, 1])
tf.summary.image('input', x_image, 4)
# First layer of weights
with tf.name_scope("layer1"):
W1 = tf.get_variable("W1", shape=[input_size, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer1 = tf.matmul(X, W1)
layer1_act = tf.nn.tanh(layer1)
tf.summary.histogram("weights", W1)
tf.summary.histogram("layer", layer1)
tf.summary.histogram("activations", layer1_act)
# Second layer of weights
with tf.name_scope("layer2"):
W2 = tf.get_variable("W2", shape=[hidden_layer_neurons, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer2 = tf.matmul(layer1_act, W2)
layer2_act = tf.nn.tanh(layer2)
tf.summary.histogram("weights", W2)
tf.summary.histogram("layer", layer2)
tf.summary.histogram("activations", layer2_act)
# Third layer of weights
with tf.name_scope("layer3"):
W3 = tf.get_variable("W3", shape=[hidden_layer_neurons, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer3 = tf.matmul(layer2_act, W3)
layer3_act = tf.nn.tanh(layer3)
tf.summary.histogram("weights", W3)
tf.summary.histogram("layer", layer3)
tf.summary.histogram("activations", layer3_act)
# Fourth layer of weights
with tf.name_scope("layer4"):
W4 = tf.get_variable("W4", shape=[hidden_layer_neurons, output_size],
initializer=tf.contrib.layers.xavier_initializer())
Qpred = tf.nn.softmax(tf.matmul(layer3_act, W4)) # Bug fixed: Qpred = tf.nn.softmax(tf.matmul(layer3, W4))
tf.summary.histogram("weights", W4)
tf.summary.histogram("Qpred", Qpred)
# We need to define the parts of the network needed for learning a policy
Y = tf.placeholder(tf.float32, [None, output_size], name="input_y")
advantages = tf.placeholder(tf.float32, name="reward_signal")
# Loss function
# Sum (Ai*logp(yi|xi))
log_lik = -Y * tf.log(Qpred)
loss = tf.reduce_mean(tf.reduce_sum(log_lik * advantages, axis=1))
tf.summary.scalar("Q", tf.reduce_mean(Qpred))
tf.summary.scalar("Y", tf.reduce_mean(Y))
tf.summary.scalar("log_likelihood", tf.reduce_mean(log_lik))
tf.summary.scalar("loss", loss)
# Learning
train = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
It appears that the network hasn't learned anything in the layers one to three. The last layer does change, so that means that there either may be something wrong with the gradients (if you're tampering with them manually), you're constraining learning to the last layer by optimizing only its weights or the last layer really 'eats up' all error. It could also be that only biases are learned. The network appears to learn something though, but it might not be using its full potential. More context would be needed here, but playing around with the learning rate (e.g. using a smaller one) might be worth a shot.
In general, histograms display the number of occurrences of a value relative to each other values. Simply speaking, if the possible values are in a range of 0..9 and you see a spike of amount 10 on the value 0, this means that 10 inputs assume the value 0; in contrast, if the histogram shows a plateau of 1 for all values of 0..9, it means that for 10 inputs, each possible value 0..9 occurs exactly once.
You can also use histograms to visualize probability distributions when you normalize all histogram values by their total sum; if you do that, you'll intuitively obtain the likelihood with which a certain value (on the x axis) will appear (compared to other inputs).
Now for layer1/weights, the plateau means that:
most of the weights are in the range of -0.15 to 0.15
it is (mostly) equally likely for a weight to have any of these values, i.e. they are (almost) uniformly distributed
Said differently, almost the same number of weights have the values -0.15, 0.0, 0.15 and everything in between. There are some weights having slightly smaller or higher values.
So in short, this simply looks like the weights have been initialized using a uniform distribution with zero mean and value range -0.15..0.15 ... give or take. If you do indeed use uniform initialization, then this is typical when the network has not been trained yet.
In comparison, layer1/activations forms a bell curve (gaussian)-like shape: The values are centered around a specific value, in this case 0, but they may also be greater or smaller than that (equally likely so, since it's symmetric). Most values appear close around the mean of 0, but values do range from -0.8 to 0.8.
I assume that the layer1/activations is taken as the distribution over all layer outputs in a batch. You can see that the values do change over time.
The layer 4 histogram doesn't tell me anything specific. From the shape, it's just showing that some weight values around -0.1, 0.05 and 0.25 tend to be occur with a higher probability; a reason could be, that different parts of each neuron there actually pick up the same information and are basically redundant. This can mean that you could actually use a smaller network or that your network has the potential to learn more distinguishing features in order to prevent overfitting. These are just assumptions though.
Also, as already stated in the comments below, do add bias units. By leaving them out, you are forcefully constraining your network to a possibly invalid solution.
Here I would indirectly explain the plot by giving a minimal example. The following code produce a simple histogram plot in tensorboard.
from datetime import datetime
import tensorflow as tf
filename = datetime.now().strftime("%Y%m%d-%H%M%S")
fw = tf.summary.create_file_writer(f'logs/fit/{filename}')
with fw.as_default():
for i in range(10):
t = tf.random.uniform((2, 2), 1000)
tf.summary.histogram(
"train/hist",
t,
step=i
)
print(t)
We see that generating a 2x2 matrix with a maximum range 1000 will produce values from 0-1000. To how this tensor might look, i am putting log of a few of them here.
tf.Tensor(
[[398.65747 939.9828 ]
[942.4269 59.790222]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[869.5309 980.9699 ]
[149.97845 454.524 ]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[967.5063 100.77594 ]
[ 47.620544 482.77008 ]], shape=(2, 2), dtype=float32)
We logged into tensorboard 10 times. The to right of the plot, a timeline is generated to indicate timesteps. The depth of histogram indicate which values are new. The lighter/front values are newer and darker/far values are older.
Values are gathered into buckets which are indicated by those triangle structures. x-axis indicate the range of values where the bunch lies.
Due to the project work of my master study I am implementing a neural network using the tensorflow library form Google. At that I would like to determine (at the output layer of my feed forward neural network) several labels in parallel. And as activation function of the output layer I want to use the softmax function.
So what I want to have specifically is a output is a Vector that looks like this:
vec = [0.1, 0.8, 0.1, 0.3, 0.2, 0.5]
Here the first three numbers are the probabilities of the three classes of the first classification and the other three numbers are the probabilities of the three classes of the second classification. So in this case I would say that the labels are:
[ class2 , class3 ]
In a first attempt I tried to implement this by first reshapeing the (1x6) vector to a (2x3) Matrix with tf.reshape(), then apply the softmax-function on the matrix tf.nn.softmax() and finally reshape the matrix back to a vector. Unfortunately, due to the reshaping, the Gradient-Descent-Optimizer gets problems with calculating the gradient, so I tried something different.
What I do now is, I take the (1x6) vector and multiply it my a matrix that has a (3x3) identity-matrix in the upper part and a (3x3) zero-matrix in the lower part. Whit this I extract the first three entries of the vector. Then I can apply the softmax function and bring it back into the old form of (1x6) by another matrix multiplication. This has to be repeated for the other three vector entries as well.
outputSoftmax = tf.nn.softmax( vec * [[1,0,0],[0,1,0],[0,0,1],[0,0,0],[0,0,0],[0,0,0]] ) * tf.transpose( [[1,0,0],[0,1,0],[0,0,1],[0,0,0],[0,0,0],[0,0,0]] )
+ tf.nn.softmax( vec * [[0,0,0],[0,0,0],[0,0,0],[1,0,0],[0,1,0],[0,0,1]] ) * tf.transpose( [[0,0,0],[0,0,0],[0,0,0],[1,0,0],[0,1,0],[0,0,1]] )
It works so far, but I don't like this solution.
Because in my real problem, I not only have to determine two labels at a time but 91, I would have to repeat the procedure form above 91-times.
Does anyone have an solution, how I can obtain the desired vector, where the softmax function is applied on only three entries at a time, without writing the "same" code 91-times?
You could apply the tf.split function to obtain 91 tensors (one for each class), then apply softmax to each of them.
classes_split = tf.split(0, 91, all_in_one)
for c in classes_split:
softmax_class = tf.nn.softmax(c)
# use softmax_class to compute some loss, add it to overall loss
or instead of computing the loss directly, you could also concatenate them together again:
classes_split = tf.split(0, 91, all_in_one)
# softmax each split individually
classes_split_softmaxed = [tf.nn.softmax(c) for c in classes_split]
# Concatenate again
all_in_one_softmaxed = tf.concat(0, classes_split_softmaxed)