I have a list of tensors and list representing their probability mass function. How can I each session run tell tensorflow to randomly pick one tensor according to probability mass function.
I see few possible ways to do that:
One is packing list of tensors in rank one higher, and select one with slice & squeeze based on tensorflow variable I'm going to assign correct index. What would be performance penalty for this approach? Would tensorflow evaluate other, non-needed tensors?
Another is using tf.case in similar fashion as before with me picking one tensor out of many. Same question -> What's the performance penalty since I plan on having quite a few(~100s) conditional statements per one graph run.
Is there any better way of doing this?
I think you should use tf.multinomial(logits, num_samples).
Say you have:
a batch of tensors of shape [batch_size, num_features]
a probability distribution of shape [batch_size]
You want to output:
1 example from the batch of tensors, of shape [1, num_features]
batch_tensors = tf.constant([[0., 1., 2.], [3., 4., 5.]]) # shape [batch_size, num_features]
probabilities = tf.constant([0.7, 0.3]) # shape [batch_size]
# we need to convert probabilities to log_probabilities and reshape it to [1, batch_size]
rescaled_probas = tf.expand_dims(tf.log(probabilities), 0) # shape [1, batch_size]
# We can now draw one example from the distribution (we could draw more)
indice = tf.multinomial(rescaled_probas, num_samples=1)
output = tf.gather(batch_tensors, tf.squeeze(indice, [0]))
What's the performance penalty since I plan on having quite a few(~100s) conditional statements per one graph run?
If you want to do multiple draws, you should do it in one run by increasing the parameter num_samples. You can then gather these num_samples examples in one run with tf.gather.
Related
I am struggling to conceptualize the correct way to prepare a timeseries dataset for LSTM training. My main concern is how do I train the network to 'remember' N previous steps. I have two possible ways in my mind but I am not sure which one is the correct.
I am really confused with this, I have tried both approaches (for 1 variable however) and they both seem to provide some plausible results.
1.) The dataset should be in a tensor format like this
X1=[[var1_t_1, var2_t_1, var3_t_1],
[var1_t_2, var2_t_2, var3_t_2],
...]
X.shape = [N, 3]
y=[ [target_t_1],
[target_t_2],
...]
y.shape = [N, 1]
During training the LSTM gets N inputs, one for each timestep, and returns back N predictions that are used to compute the loss and update weights.
The network on its own "creates memmory" about previous time step values through its cell states. But for how many previous steps can it create memmory, is there any way to define this memmory (if possible answer with pytorch example).
2.) The dataset should already contain the previous timestep values as features, so a 3rd dimension is neccessary eg.
X = [ [var1_t_1, var1_t_2,..., var1_t_10], [var2_t_1,..., var2_t_10], [var3_t_1,..., var3_t_10],
[var1_t_2, var1_t_3,..., var1_t_11], [var2_t_2,..., var2_t_11], [var3_t_2,..., var3_t_11],
...]
X.shape = [N-10, 10, 3]
y = [ [target_t_11],
[target_t_12],
... ]
y.shape = [N-10, 1]
In this way we define the number of previous steps the LSTM should try to remember. For the example above we "ask" the LSTM to remember at least 10 previous prices in order to make predictions.
Any help to clarify the concept is greatly appreciated. Pytorch code would be extremely welcome as well.
I am trying to implement the circuits listed on page 8 in the following paper: https://arxiv.org/pdf/1905.10876.pdf using Tensorflow Quantum (TFQ). I have done so previously for a subset of circuits using Qiskit, and ended up with accuracies that can be found on page 14 in the following paper: https://arxiv.org/pdf/2003.09887.pdf. In TFQ, my accuracies are way down. I think this delta originates because in TFQ, I only used 1 observable Pauli Z operator on the first qubit, and the circuits do not seem to "transfer all knowledge" to the first qubit. I place this in quotes, because I am sure there is a better way to describe this. In Qiskit on the other hand, 16 states (4^2) get mapped to 2 states.
My question: how can I get my accuracies back up?
Potential answer a): some method of "transferring all information" to a single qubit, potentially an ancilla qubit, and doing a readout on this qubit.
Potential answer b) placing a Pauli Z observable on all qubits (4 in total), mapping half of the 16 states to a label 0 and the other half to a label 1. I attempted this in the code below.
My attempt at answer b):
I have a Tensorflow Quantum (TFQ) circuit implemented in Tensorflow. The circuit has multiple observables, which I try to bring together in my loss function. I prefer to use as many standard components as possible, but need to map my quantum states to a label in order to determine the loss. I think what I am trying to achieve is not unique to TFQ. I define my model in the following way:
def circuit():
data_qubits = cirq.GridQubit.rect(4, 1)
circuit = cirq.Circuit()
...
return circuit, [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])]
model_circuit, model_readout = circuit()
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(), dtype=tf.string),
# The PQC layer returns the expected value of the readout gate, range [-1,1].
tfq.layers.PQC(model_circuit, model_readout),
])
# compile model
model.compile(
loss = loss_mse,
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
metrics=[])
in loss_mse (Mean Square Error), I receive a (32, 4) tensor for y_pred. One row could look like
[-0.2, 0.33, 0.6, 0.3]
This would have to be first mapped from [-1,1] to a binarized version of [0,1], so that it looks like:
[0, 1, 1, 1]
Now, a table lookup needs to happen, which tells if this combination is 0 or 1. Finally, the regular (y_true-y_pred)^2 can be performed by that row, followed by a np.sum on all rows. I tried to implement this:
def get_label(measurement):
if measurement == [0,0,0,0]: return 0
...
elif measurement == [1,1,1,1]: return 0
else: return -1
def py_call(y_true, y_pred):
# cast tensor to numpy
y_pred_np = np.asarray(y_pred)
loss = np.zeros((len(y_pred))) # could be a single variable with += within the loop
# evalaute all 32 samples
for pred in range(len(y_pred_np)):
# map, binarize and lookup
y_labelled = get_label([0 if y<0 else 1 for y in y_pred_np[pred]])
# regular loss comparison
loss[pred] = (y_labelled - y_true[pred])**2
# reduce
loss = np.sum(loss)/len(y_true)
return loss
#tf.function
def loss_mse(y_true, y_pred):
external_list = []
loss = tf.py_function(py_call, inp=[y_true, y_pred], Tout=[tf.float64])
return loss
However, the system appears to still expect a (32,4) tensor. I would have thought I could simply provide a single loss values (float). My question: how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function?
So it looks like there are a couple of things going on here. To answer your question
how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function ?
What you might want is some kind of tf.reduce_* function like tf.reduce_mean or tf.reduce_sum. This function will allow you to apply this reduction operation accross a given tensor axis allowing you to convert a tensor of shape (32, 4) to a tensor of shape (32,) or a tensor of shape (4,). Here is a quick snippet:
#tf.function
def my_loss(y_true, y_pred):
# y_true is shape (32, 4)
# y_pred is shape (32, 4)
# Scale from [-1, 1] to [0, 1]
y_true += 1
y_true /= 2
y_pred += 1
y_pred /= 2
# These are now both (32,) with the reduction of taking the mean applied along
# the second axis.
reduced_true = tf.reduce_mean(y_true, axis=1)
reduced_pred = tf.reduce_mean(y_pred, axis=1)
# Now a scalar loss.
loss = tf.reduce_mean((reduce_true - reduced_pred) ** 2)
return loss
Now the above isn't exactly what you want, since it's not super clear to me at least what exact reduction rules you have in mind for taking something like [0,1,1,1] -> 0 vs [0,0,0,0] -> 1.
Another thing I will also mention is that if you want JUST the sum of these Pauli Operators in cirq that you have term by term in the list [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])] and all you care about is the final sum of these expectations, you could just as easily do:
my_operator = sum([cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]),
cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])])
print(my_op)
Which should give something like:
cirq.PauliSum(cirq.LinearDict({frozenset({(cirq.GridQubit(0, 0), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 1), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 2), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 3), cirq.Z)}): (1+0j)}))
Which is also compatable as a readout operation in the PQC layer. Lastly if would recommend reading through some of the snippets and examples here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/PQC
and here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/Expectation
Which give a pretty good description of how the input and output signatures of the functions look as well as the shapes you can expect from them.
I want to increase amount of recurrent weights in rnn or lstm cell.
If you look at the code below you will see, that lsrm cell inputs shape is (2,1), which means 2 timesteps and 1 feature.
%tensorflow_version 2.x
import tensorflow as tf
m = tf.keras.models.Sequential()
lstm = tf.keras.layers.LSTM(1, use_bias=False)
input = tf.keras.Input(shape=(2,1))
m.add(input)
m.add(lstm)
lstm.get_weights()
The output is
[array([[ 0.878217 , 0.89324415, 0.404307 , -1.0542995 ]], dtype=float32),
array([[-0.24181306, -0.341401 , 0.65207034, 0.63227856]], dtype=float32)]
4 weights for each feature, and 4 weights for previous outputs
Now if I change Input shape like this
input = tf.keras.Input(shape=(2,1))
then the output of get_weights function will be like this:
[array([[-0.9725287 , -0.90078545, 0.97881985, -0.9623983 ],
[-0.9644511 , 0.90705967, 0.05965471, 0.32613564]], dtype=float32),
array([[-0.24867296, -0.22346373, -0.6410606 , 0.69084513]], dtype=float32)]
Now my question is: how do I increase amount of weights in the second array whick keeps the (4,1) shape?
The idea is that I want RNN or STRM take not only the previous output (t-1 moment) but more prevois values like (t-2, t-3, t-4) moments.
Is there way to do it in keras with tf backend?
I can't understand the change, I think you had a typo in your question, but:
Length - Time steps:
The number of time steps will never change the number of weights. The layer is "recurrent", meaning it will "loop" the time steps. It's not supposed to have different weights for each step.
The whole purpose of the layer is to apply the same operations over and over and over for each time step.
Input features:
Input features are the last dimension of the input. They define one dimension of the weights.
Units = Output features:
Output features, also the last dimension of the output, are another dimension of the weights.
Two types of kernels
The LSTM layers have two groups of kernels:
What they call simply kernels - with shape=(input_dim, self.units * 4)
What they call recurrent kernels - with shape=(self.units, self.units * 4)
The first group acts on the input data, they have shape considering the input features and the output features.
The second group acts on inner states and have shapes considering only the output features (units).
From the source code:
self.kernel = self.add_weight(shape=(input_dim, self.units * 4),
name='kernel',
initializer=self.kernel_initializer,
regularizer=self.kernel_regularizer,
constraint=self.kernel_constraint)
self.recurrent_kernel = self.add_weight(
shape=(self.units, self.units * 4),
name='recurrent_kernel',
initializer=self.recurrent_initializer,
regularizer=self.recurrent_regularizer,
constraint=self.recurrent_constraint)
The last array in the list:
The last array in the list of weights are the 4 recurrent kernels with shape (1, 1) grouped into one.
So:
You can increase the kernels with more input features. Transform Input((anything, 1)) into Input((anything, more)) for instance.
You can increase the kernels and the recurrent_kernels (and biases, when considered) with bigger output features. Transform LSTM(1, ...) into LSTM(more, ...)
Weights are independent of the lenght. It's even possible to have Input((None, 1)), meaning a variable length.
Using more than just the last step
This should be automatic. LSTM layers are designed to have memory. The memory is an inner state that participates in all time steps. There are gates (the kernels) that decide how a new step will participate in this memory. Since all steps participate in the same memory, LSTM layer theoretically considers "all" time steps from the beginning.
So, you shouldn't really worry with this.
But if you do want this, there are maybe two ways. Don't know if they will bring any improvement, though.
One is to concatenate shifted inputs as features:
def pad_and_shift(x):
steps = 3
paddings = tf.constant([[0,0], [steps-1, 0], [0, 0]])
x = tf.pad(x, paddings)
to_concat = [ x[:,i:i - steps + 1] for i in range(steps-1) ]
to_concat += x[:, steps-1:]
return tf.concat(to_concat, axis=-1)
given_inputs = ....
out = Lambda(pad_and_shift)(given_inputs)
out = LSTM(units, ...)(out)
The other involves editing the source code of the LSTM, which would be very complicated and probably not very worthy.
When I use the
tf.losses.mean_pairwise_squared_error(labels, predictions, weights=1.0, scope=None, loss_collection=tf.GraphKeys.LOSSES)
function, I am sure the data is right. However, the loss on the tensorboard is always zero. I try hard to find it out, but do not know why? Following is the part of my code. Am I using the wrong shape?
score_a=tf.reshape(score,[-1])#shape: [1,39]
ys_a=tf.reshape(ys,[-1])#shape: [1,39]
with tf.name_scope('loss'):
loss=tf.losses.mean_pairwise_squared_error(score_a,ys_a)
To use tf.losses.mean_pairwise_squared_error(), labels and predictions should be of rank at least 2, because the first dimension will be used as batch_size. It means that you do not need to reshape score_a and ys_a. (I assume that score_a and ys_a have 39 entries and one batch.)
If labels and predictions are of rank 1, it means that all data entries are 0-tensor (scalar) so that the result of tf.losses.mean_pairwise_squared_error() becomes always zero.
One more thing. In my opinion, the current implementation (2018-01-03) of tf.losses.mean_pairwise_squared_error() looks imperfect. For example, as showin in the API document of the function, put the following data as labels and predictions:
labels = tf.constant([[0., 0.5, 1.]])
predictions = tf.constant([[1., 1., 1.]])
tf.losses.mean_pairwise_squared_error(labels, predictions)
In this case, the result should be [(0-0.5)^2+(0-1)^2+(0.5-1)^2]/3=0.5 which is different from the result 0.3333333134651184 by tensorflow.
[Update] The bug (mentioned above) in tf.losses.mean_pairwise_squared_error() was fixed and applied from tensorflow 1.6.0.
If you are doing classification, you may have your label tensor with categorical value in one dimension. If that is the case, you will need to perform one hot transformation to match the shape of the predictions before calling tf.losses.mean_pairwise_squared_error(). You do not need to reshape the predictions.
labels_a = tf.one_hot(labels, 2)
It is really straightforward to see and understand the scalar values in TensorBoard. However, it's not clear how to understand histogram graphs.
For example, they are the histograms of my network weights.
(After fixing a bug thanks to sunside)
What is the best way to interpret these? Layer 1 weights look mostly flat, what does this mean?
I added the network construction code here.
X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
x_image = tf.reshape(X, [-1, 6, 10, 1])
tf.summary.image('input', x_image, 4)
# First layer of weights
with tf.name_scope("layer1"):
W1 = tf.get_variable("W1", shape=[input_size, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer1 = tf.matmul(X, W1)
layer1_act = tf.nn.tanh(layer1)
tf.summary.histogram("weights", W1)
tf.summary.histogram("layer", layer1)
tf.summary.histogram("activations", layer1_act)
# Second layer of weights
with tf.name_scope("layer2"):
W2 = tf.get_variable("W2", shape=[hidden_layer_neurons, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer2 = tf.matmul(layer1_act, W2)
layer2_act = tf.nn.tanh(layer2)
tf.summary.histogram("weights", W2)
tf.summary.histogram("layer", layer2)
tf.summary.histogram("activations", layer2_act)
# Third layer of weights
with tf.name_scope("layer3"):
W3 = tf.get_variable("W3", shape=[hidden_layer_neurons, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer3 = tf.matmul(layer2_act, W3)
layer3_act = tf.nn.tanh(layer3)
tf.summary.histogram("weights", W3)
tf.summary.histogram("layer", layer3)
tf.summary.histogram("activations", layer3_act)
# Fourth layer of weights
with tf.name_scope("layer4"):
W4 = tf.get_variable("W4", shape=[hidden_layer_neurons, output_size],
initializer=tf.contrib.layers.xavier_initializer())
Qpred = tf.nn.softmax(tf.matmul(layer3_act, W4)) # Bug fixed: Qpred = tf.nn.softmax(tf.matmul(layer3, W4))
tf.summary.histogram("weights", W4)
tf.summary.histogram("Qpred", Qpred)
# We need to define the parts of the network needed for learning a policy
Y = tf.placeholder(tf.float32, [None, output_size], name="input_y")
advantages = tf.placeholder(tf.float32, name="reward_signal")
# Loss function
# Sum (Ai*logp(yi|xi))
log_lik = -Y * tf.log(Qpred)
loss = tf.reduce_mean(tf.reduce_sum(log_lik * advantages, axis=1))
tf.summary.scalar("Q", tf.reduce_mean(Qpred))
tf.summary.scalar("Y", tf.reduce_mean(Y))
tf.summary.scalar("log_likelihood", tf.reduce_mean(log_lik))
tf.summary.scalar("loss", loss)
# Learning
train = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
It appears that the network hasn't learned anything in the layers one to three. The last layer does change, so that means that there either may be something wrong with the gradients (if you're tampering with them manually), you're constraining learning to the last layer by optimizing only its weights or the last layer really 'eats up' all error. It could also be that only biases are learned. The network appears to learn something though, but it might not be using its full potential. More context would be needed here, but playing around with the learning rate (e.g. using a smaller one) might be worth a shot.
In general, histograms display the number of occurrences of a value relative to each other values. Simply speaking, if the possible values are in a range of 0..9 and you see a spike of amount 10 on the value 0, this means that 10 inputs assume the value 0; in contrast, if the histogram shows a plateau of 1 for all values of 0..9, it means that for 10 inputs, each possible value 0..9 occurs exactly once.
You can also use histograms to visualize probability distributions when you normalize all histogram values by their total sum; if you do that, you'll intuitively obtain the likelihood with which a certain value (on the x axis) will appear (compared to other inputs).
Now for layer1/weights, the plateau means that:
most of the weights are in the range of -0.15 to 0.15
it is (mostly) equally likely for a weight to have any of these values, i.e. they are (almost) uniformly distributed
Said differently, almost the same number of weights have the values -0.15, 0.0, 0.15 and everything in between. There are some weights having slightly smaller or higher values.
So in short, this simply looks like the weights have been initialized using a uniform distribution with zero mean and value range -0.15..0.15 ... give or take. If you do indeed use uniform initialization, then this is typical when the network has not been trained yet.
In comparison, layer1/activations forms a bell curve (gaussian)-like shape: The values are centered around a specific value, in this case 0, but they may also be greater or smaller than that (equally likely so, since it's symmetric). Most values appear close around the mean of 0, but values do range from -0.8 to 0.8.
I assume that the layer1/activations is taken as the distribution over all layer outputs in a batch. You can see that the values do change over time.
The layer 4 histogram doesn't tell me anything specific. From the shape, it's just showing that some weight values around -0.1, 0.05 and 0.25 tend to be occur with a higher probability; a reason could be, that different parts of each neuron there actually pick up the same information and are basically redundant. This can mean that you could actually use a smaller network or that your network has the potential to learn more distinguishing features in order to prevent overfitting. These are just assumptions though.
Also, as already stated in the comments below, do add bias units. By leaving them out, you are forcefully constraining your network to a possibly invalid solution.
Here I would indirectly explain the plot by giving a minimal example. The following code produce a simple histogram plot in tensorboard.
from datetime import datetime
import tensorflow as tf
filename = datetime.now().strftime("%Y%m%d-%H%M%S")
fw = tf.summary.create_file_writer(f'logs/fit/{filename}')
with fw.as_default():
for i in range(10):
t = tf.random.uniform((2, 2), 1000)
tf.summary.histogram(
"train/hist",
t,
step=i
)
print(t)
We see that generating a 2x2 matrix with a maximum range 1000 will produce values from 0-1000. To how this tensor might look, i am putting log of a few of them here.
tf.Tensor(
[[398.65747 939.9828 ]
[942.4269 59.790222]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[869.5309 980.9699 ]
[149.97845 454.524 ]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[967.5063 100.77594 ]
[ 47.620544 482.77008 ]], shape=(2, 2), dtype=float32)
We logged into tensorboard 10 times. The to right of the plot, a timeline is generated to indicate timesteps. The depth of histogram indicate which values are new. The lighter/front values are newer and darker/far values are older.
Values are gathered into buckets which are indicated by those triangle structures. x-axis indicate the range of values where the bunch lies.