difference between tf.matmul and torch.matmul when tolerance >= 1e-5 - tensorflow

gat_key_t = np.random.normal(size = (8, 16, 64, 20)).astype(np.float32)
gat_query_t = np.random.normal(size = (8, 16, 30, 64)).astype(np.float32)
tf_key = tf.convert_to_tensor(gat_key_t)
tf_query = tf.convert_to_tensor(gat_query_t)
pt_key = torch.from_numpy(gat_key_t)
pt_query = torch.from_numpy(gat_query_t)
tf_output = tf.matmul(tf_query, tf_key)
pt_output = torch.matmul(pt_query, pt_key)
# False
np.allclose(tf_output.numpy(), pt_output.numpy(), rtol = 1e-5, atol = 1e-5, equal_nan = False)
# True
np.allclose(tf_output.numpy(), pt_output.numpy(), rtol = 1e-4, atol = 1e-4, equal_nan = False)
When I multiply two tensors, the outputs of torch and tensorflow are different when tolerance is smaller than 1e-5.
As above, two values are the same until 1e-4, but they become different as tolerance becomes smaller.
How can I make two output be the same in the tolerance of 1e-5?

Had encountered this issue recently when trying to port a transformer model from pytorch to TF. Only their CPU version of TF seems to be closer to both pytorch matmul and numpy's matmul. Casting the params to tf.float64 also improves the precision. Their GPU implementation of matmul (which uses cublas) seems to suffer from precision issues.
The closer i got to improve the precision is by native way of implementing matmul:
tf_output = tf.reduce_sum(tf.expand_dims(tf.transpose(tf_query, (0, 1, 3, 2)), -1)*tf.expand_dims(tf_key,3), 2)

Related

Why tf.keras.Model training flag significantly alters the prediction result?

I was recently going through the tensorflow pix2pix tutorial and after playing with it a bit I unexpectedly realized that there is a major difference between the predictions of a tf.keras.Model (In this case the Generator() from the tutorial) where one of the prediction use the training flag to true and the other to false.
Here is the code to demonstrate the issue:
# ...Tutorial steps where I load the model instead of creating a new one...
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))
for example_input, example_target in test_dataset.take(1):
train_res1 = generate_images(generator, example_input, example_target) # Function definition as per tutorial expect that I return the 'Predicted Image'
train_res2 = generate_images(generator, example_input, example_target) # Now considering training is true (will alter the model), a small RGB difference is expected
notrain_res2 = generate_images2(generator, example_input, example_target) # Identical to 'generate_images' except that 'training=false', should be identical or similar to last one.
r_avg = np.average(train_res1[:,:, 0])
g_avg = np.average(train_res1[:,:, 1])
b_avg = np.average(train_res1[:,:, 2])
print(f"Training flag true iteration#1 = R average: {r_avg}, G average: {g_avg}, B average: {b_avg}")
r_avg = np.average(train_res2[:,:, 0])
g_avg = np.average(train_res2[:,:, 1])
b_avg = np.average(train_res2[:,:, 2])
print(f"Training flag true iteration#2 = R average: {r_avg}, G average: {g_avg}, B average: {b_avg}")
r_avg = np.average(notrain_res2[:,:, 0])
g_avg = np.average(notrain_res2[:,:, 1])
b_avg = np.average(notrain_res2[:,:, 2])
print(f"Training flag false = R average: {r_avg}, G average: {g_avg}, B average: {b_avg}")
Just to avoid any confusion, here is the code of generate_images2 which is identical to generate_images from tutorial except that 'training=False' and I return the prediction:
def generate_images2(model, test_input, tar):
prediction = model(test_input, training=False)
plt.figure(figsize=(15, 15))
display_list = [test_input[0], tar[0], prediction[0]]
title = ['Input Image', 'Ground Truth', 'Predicted Image']
for i in range(3):
plt.subplot(1, 3, i+1)
plt.title(title[i], color = "w")
# Getting the pixel values in the [0, 1] range to plot.
plt.imshow(display_list[i] * 0.5 + 0.5)
plt.axis('off')
plt.show()
return display_list[2]
Here you can vizualize my concerns with the training flag.
As expected there are minor differences between the RGB values of iteration#1 and iteration#2 with training flag = True. This is expected due to training model alterations.
However when training flag = False, I would expect the RGB values to be similar or identical to the iteration#2 with training flag = True but if you look at the door in the yellow and red circle s the RGB values are clearly different.
The result is pretty much always better with training=True
Question: Why tf.keras.Model training flag significantly alters the prediction result?
Here's an answer from the Tensorflow repo:
https://github.com/tensorflow/tensorflow/issues/36936
There are some things that only happen during training, for example dropout is used. If training=False then dropout layers are ignored (see https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout)

model size too big with my attention model implementation?

I am implementing Minh-Thang Luong's attention model to build a english to chinese translater.And the model i trained has abnormally big size(980 MB).Minh-Thang Luong's original paper
this is model parameters
state size:120
source language vocabulary size:400000
source language word embedding size:400000*50
target language vocabulary size:20000
target language word embedding size:20000*300
This is my model implementation in tensorflow.
import tensorflow as tf
src_vocab_size=400000
src_w2v_dim=50
tgt_vocab_size=20000
tgt_w2v_dim=300
state_size=120
with tf.variable_scope('net_encode'):
ph_src_embedding = tf.placeholder(dtype=tf.float32,shape=[src_vocab_size,src_w2v_dim],name='src_vocab_embedding_placeholder')
#src_word_emb = tf.Variable(initial_value=ph_src_embedding,dtype=tf.float32,trainable=False, name='src_vocab_embedding_variable')
encoder_X_ix = tf.placeholder(shape=(None, None), dtype=tf.int32)
encoder_X_len = tf.placeholder(shape=(None), dtype=tf.int32)
encoder_timestep = tf.shape(encoder_X_ix)[1]
encoder_X = tf.nn.embedding_lookup(ph_src_embedding, encoder_X_ix)
batchsize = tf.shape(encoder_X_ix)[0]
encoder_Y_ix = tf.placeholder(shape=[None, None],dtype=tf.int32)
encoder_Y_onehot = tf.one_hot(encoder_Y_ix, src_vocab_size)
enc_cell = tf.nn.rnn_cell.LSTMCell(state_size)
enc_initstate = enc_cell.zero_state(batchsize,dtype=tf.float32)
enc_outputs, enc_final_states = tf.nn.dynamic_rnn(enc_cell,encoder_X,encoder_X_len,enc_initstate)
enc_pred = tf.layers.dense(enc_outputs, units=src_vocab_size)
encoder_loss = tf.losses.softmax_cross_entropy(encoder_Y_onehot,enc_pred)
encoder_trainop = tf.train.AdamOptimizer(0.001).minimize(encoder_loss)
with tf.variable_scope('net_decode'):
ph_tgt_embedding = tf.placeholder(dtype=tf.float32, shape=[tgt_vocab_size, tgt_w2v_dim],
name='tgt_vocab_embedding_placeholder')
#tgt_word_emb = tf.Variable(initial_value=ph_tgt_embedding, dtype=tf.float32, trainable=False, name='tgt_vocab_embedding_variable')
decoder_X_ix = tf.placeholder(shape=(None, None), dtype=tf.int32)
decoder_timestep = tf.shape(decoder_X_ix)[1]
decoder_X_len = tf.placeholder(shape=(None), dtype=tf.int32)
decoder_X = tf.nn.embedding_lookup(ph_tgt_embedding, decoder_X_ix)
decoder_Y_ix = tf.placeholder(shape=[None, None],dtype=tf.int32)
decoder_Y_onehot = tf.one_hot(decoder_Y_ix, tgt_vocab_size)
dec_cell = tf.nn.rnn_cell.LSTMCell(state_size)
dec_outputs, dec_final_state = tf.nn.dynamic_rnn(dec_cell,decoder_X,decoder_X_len,enc_final_states)
tile_enc = tf.tile(tf.expand_dims(enc_outputs,1),[1,decoder_timestep,1,1]) # [batchsize,decoder_len,encoder_len,state_size]
tile_dec = tf.tile(tf.expand_dims(dec_outputs, 2), [1, 1, encoder_timestep, 1]) # [batchsize,decoder_len,encoder_len,state_size]
enc_dec_cat = tf.concat([tile_enc,tile_dec],-1) # [batchsize,decoder_len,encoder_len,state_size*2]
weights = tf.nn.softmax(tf.layers.dense(enc_dec_cat,units=1),axis=-2) # [batchsize,decoder_len,encoder_len,1]
weighted_enc = tf.tile(weights, [1, 1, 1, state_size])*tf.tile(tf.expand_dims(enc_outputs,1),[1,decoder_timestep,1,1]) # [batchsize,decoder_len,encoder_len,state_size]
attention = tf.reduce_sum(weighted_enc,axis=2,keepdims=False) # [batchsize,decoder_len,state_size]
dec_attention_cat = tf.concat([dec_outputs,attention],axis=-1) # [batchsize,decoder_len,state_size*2]
dec_pred = tf.layers.dense(dec_attention_cat,units=tgt_vocab_size) # [batchsize,decoder_len,tgt_vocab_size]
pred_ix = tf.argmax(dec_pred,axis=-1) # [batchsize,decoder_len]
decoder_loss = tf.losses.softmax_cross_entropy(decoder_Y_onehot,dec_pred)
total_loss = encoder_loss + decoder_loss
decoder_trainop = tf.train.AdamOptimizer(0.001).minimize(total_loss)
_l0 = tf.summary.scalar('decoder_loss',decoder_loss)
_l1 = tf.summary.scalar('encoder_loss',encoder_loss)
log_all = tf.summary.merge_all()
writer = tf.summary.FileWriter(log_path,graph=tf.get_default_graph())
this is a run down of model parameters size that i can think of so far
encoder cell
=(50*120+120*120+120)*4
=(src_lang_embedding_size*statesize+statesize*statesize+statesize)*(forget gate,remember gate,new state,output gate)
=(kernelsize_for_input+kernelsize_for_previous_state+bias)*(forget gate,remember gate,new state,output gate)
=82080 floats
encoder dense layer
=120*400000
=statesize*src_lang_vocabulary_size
=48000000 floats
decoder cell
=(300*120+120*120+120)*4
=(target_lang_embedding_size*statesize+statesize*statesize+statesize)*(forget gate,remember gate,new state,output gate)
=(kernelsize_for_input+kernelsize_for_previous_state+bias)*(forget gate,remember gate,new state,output gate)
=202080 floats
dense layer that compute attention weights
=(120+120)*1
=(encoder_output_size+decoder_output_size)*(1 unit)
=240 floats
decoder dense layer
=(120+120)*20000
=(attention_vector_size+decoder_outputsize)*target_lang_vocabulary_size
=4800000 floats
summing them all gets 212 MB,but the actual model size is 980 MB.So where is wrong?
You are only computing the number of trainable parameters, these are not the only numbers you need to accommodate in the GPU memory.
You are using Adam optimizer so, you need to store gradients for all your parameters and momentums for all the parameters. This means that you need to store each parameter 3 times, this gives you 636 MB.
Then, you need to accommodate all the intermediate states of the network for the forward and the backward pass.
Let's say have a batch size of b and source and the target length of 50, then you have (at least, I might have something forgotten):
b×l×50 source embeddings,
b×l×300 target embeddings,
b×l×5×120 encoder states,
b×l×400000 encoder logits,
b×l×5×300 decoder states,
b×l×120 intermediate attention states,
b×l×20000 output logits.
This is in total 421970×b×l floats that you need to store for your forward and backward pass.
Btw. source vocabulary 400k is a tremendously large number, I don't believe most of them is frequent enough to learn anything meaningful about them. You should use pre-processing (i.e., SentencePiece) that would reduce your vocabulary to a reasonable size.

How to use previous output and hidden states from LSTM for the attention mechanism?

I am currently trying to code the attention mechanism from this paper: "Effective Approaches to Attention-based Neural Machine Translation", Luong, Pham, Manning (2015). (I use global attention with the dot score).
However, I am unsure on how to input the hidden and output states from the lstm decode. The issue is that the input of the lstm decoder at time t depends on quantities that I need to compute using the output and hidden states from t-1.
Here is the relevant part of the code:
with tf.variable_scope('data'):
prob = tf.placeholder_with_default(1.0, shape=())
X_or = tf.placeholder(shape = [batch_size, timesteps_1, num_input], dtype = tf.float32, name = "input")
X = tf.unstack(X_or, timesteps_1, 1)
y = tf.placeholder(shape = [window_size,1], dtype = tf.float32, name = "label_annotation")
logits = tf.zeros((1,1), tf.float32)
with tf.variable_scope('lstm_cell_encoder'):
rnn_layers = [tf.nn.rnn_cell.LSTMCell(size) for size in [hidden_size, hidden_size]]
multi_rnn_cell = tf.nn.rnn_cell.MultiRNNCell(rnn_layers)
lstm_outputs, lstm_state = tf.contrib.rnn.static_rnn(cell=multi_rnn_cell,inputs=X,dtype=tf.float32)
concat_lstm_outputs = tf.stack(tf.squeeze(lstm_outputs))
last_encoder_state = lstm_state[-1]
with tf.variable_scope('lstm_cell_decoder'):
initial_input = tf.unstack(tf.zeros(shape=(1,1,hidden_size2)))
rnn_decoder_cell = tf.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple = True)
# Compute the hidden and output of h_1
for index in range(window_size):
output_decoder, state_decoder = tf.nn.static_rnn(rnn_decoder_cell, initial_input, initial_state=last_encoder_state, dtype=tf.float32)
# Compute the score for source output vector
scores = tf.matmul(concat_lstm_outputs, tf.reshape(output_decoder[-1],(hidden_size,1)))
attention_coef = tf.nn.softmax(scores)
context_vector = tf.reduce_sum(tf.multiply(concat_lstm_outputs, tf.reshape(attention_coef, (window_size, 1))),0)
context_vector = tf.reshape(context_vector, (1,hidden_size))
# compute the tilda hidden state \tilde{h}_t=tanh(W[c_t, h_t]+b_t)
concat_context = tf.concat([context_vector, output_decoder[-1]], axis = 1)
W_tilde = tf.Variable(tf.random_normal(shape = [hidden_size*2, hidden_size2], stddev = 0.1), name = "weights_tilde", trainable = True)
b_tilde = tf.Variable(tf.zeros([1, hidden_size2]), name="bias_tilde", trainable = True)
hidden_tilde = tf.nn.tanh(tf.matmul(concat_context, W_tilde)+b_tilde) # hidden_tilde is [1*64]
# update for next time step
initial_input = tf.unstack(tf.reshape(hidden_tilde, (1,1,hidden_size2)))
last_encoder_state = state_decoder
# predict the target
W_target = tf.Variable(tf.random_normal(shape = [hidden_size2, 1], stddev = 0.1), name = "weights_target", trainable = True)
logit = tf.matmul(hidden_tilde, W_target)
logits = tf.concat([logits, logit], axis = 0)
logits = logits[1:]
The part inside the loop is what I am unsure of. Does tensorflow remember the computational graph when I overwrite the variable "initial_input" and "last_encoder_state"?
I think your model will be much simplified if you use tf.contrib.seq2seq.AttentionWrapper with one of implementations: BahdanauAttention or LuongAttention.
This way it'll be possible to wire the attention vector on a cell level, so that cell output is already after attention applied. Example from the seq2seq tutorial:
cell = LSTMCell(512)
attention_mechanism = tf.contrib.seq2seq.LuongAttention(512, encoder_outputs)
attn_cell = tf.contrib.seq2seq.AttentionWrapper(cell, attention_mechanism, attention_size=256)
Note that this way you won't need a loop of window_size, because tf.nn.static_rnn or tf.nn.dynamic_rnn will instantiate the cells wrapped with attention.
Regarding your question: you should distinguish python variables and tensorflow graph nodes: you can assign last_encoder_state to a different tensor, the original graph node won't change because of this. This is flexible, but can be also misleading in the result network - you might think that you connect an LSTM to one tensor, but it's actually the other. In general, you shouldn't do that.

How to use `sparse_softmax_cross_entropy_with_logits`: without getting Incompatible Shapes Error

I would like to use the sparse_softmax_cross_entropy_with_logits
with the julia TensorFlow wrapper.
The operations is defined in the code here.
Basically, as I understand it the first argument should be logits, that would normally be fed to softmax to get them to be category probabilities (~1hot output).
And the second should be the correct labels as label ids.
I have adjusted the example code from the TensorFlow.jl readme
See below:
using Distributions
using TensorFlow
# Generate some synthetic data
x = randn(100, 50)
w = randn(50, 10)
y_prob = exp(x*w)
y_prob ./= sum(y_prob,2)
function draw(probs)
y = zeros(size(probs))
for i in 1:size(probs, 1)
idx = rand(Categorical(probs[i, :]))
y[i, idx] = 1
end
return y
end
y = draw(y_prob)
# Build the model
sess = Session(Graph())
X = placeholder(Float64)
Y_obs = placeholder(Float64)
Y_obs_lbl = indmax(Y_obs, 2)
variable_scope("logisitic_model", initializer=Normal(0, .001)) do
global W = get_variable("weights", [50, 10], Float64)
global B = get_variable("bias", [10], Float64)
end
L = X*W + B
Y=nn.softmax(L)
#costs = log(Y).*Y_obs #Dense (Orginal) way
costs = nn.sparse_softmax_cross_entropy_with_logits(L, Y_obs_lbl+1) #sparse way
Loss = -reduce_sum(costs)
optimizer = train.AdamOptimizer()
minimize_op = train.minimize(optimizer, Loss)
saver = train.Saver()
# Run training
run(sess, initialize_all_variables())
cur_loss, _ = run(sess, [Loss, minimize_op], Dict(X=>x, Y_obs=>y))
When I run it however, I get an error:
Tensorflow error: Status: Incompatible shapes: [1,100] vs. [100,10]
[[Node: gradients/SparseSoftmaxCrossEntropyWithLogits_10_grad/mul = Mul[T=DT_DOUBLE, _class=[], _device="/job:localhost/replica:0/task:0/cpu:0"](gradients/SparseSoftmaxCrossEntropyWithLogits_10_grad/ExpandDims, SparseSoftmaxCrossEntropyWithLogits_10:1)]]
in check_status(::TensorFlow.Status) at /home/ubuntu/.julia/v0.5/TensorFlow/src/core.jl:101
in run(::TensorFlow.Session, ::Array{TensorFlow.Port,1}, ::Array{Any,1}, ::Array{TensorFlow.Port,1}, ::Array{Ptr{Void},1}) at /home/ubuntu/.julia/v0.5/TensorFlow/src/run.jl:96
in run(::TensorFlow.Session, ::Array{TensorFlow.Tensor,1}, ::Dict{TensorFlow.Tensor,Array{Float64,2}}) at /home/ubuntu/.julia/v0.5/TensorFlow/src/run.jl:143
This only happens when I try to train it.
If I don't include an optimise function/output then it works fine.
So I am doing something that screws up the gradient math.

Avoiding optimization pitfalls when modeling an ordinal predicted variable in PyMC3

I am trying to model an ordinal predicted variable using PyMC3 based on the approach in chapter 23 of Doing Bayesian Data Analysis. I would like to determine a good starting value using find_MAP, but am receiving an optimization error.
The model:
import pymc3 as pm
import numpy as np
import theano
import theano.tensor as tt
# Some helper functions
def cdf(x, location=0, scale=1):
epsilon = np.array(1e-32, dtype=theano.config.floatX)
location = tt.cast(location, theano.config.floatX)
scale = tt.cast(scale, theano.config.floatX)
div = tt.sqrt(2 * scale ** 2 + epsilon)
div = tt.cast(div, theano.config.floatX)
erf_arg = (x - location) / div
return .5 * (1 + tt.erf(erf_arg + epsilon))
def percent_to_thresh(idx, vect):
return 5 * tt.sum(vect[:idx + 1]) + 1.5
def full_thresh(thresh):
idxs = tt.arange(thresh.shape[0] - 1)
thresh_mod, updates = theano.scan(fn=percent_to_thresh,
sequences=[idxs],
non_sequences=[thresh])
return tt.concatenate([[-1 * np.inf, 1.5], thresh_mod, [6.5, np.inf]])
def compute_ps(thresh, location, scale):
f_thresh = full_thresh(thresh)
return cdf(f_thresh[1:], location, scale) - cdf(f_thresh[:-1], location, scale)
# Generate data
real_ps = [0.05, 0.05, 0.1, 0.1, 0.2, 0.3, 0.2]
data = np.random.choice(7, size=1000, p=real_ps)
# Run model
with pm.Model() as model:
mu = pm.Normal('mu', mu=4, sd=3)
sigma = pm.Uniform('sigma', lower=0.1, upper=70)
thresh = pm.Dirichlet('thresh', a=np.ones(5))
cat_p = compute_ps(thresh, mu, sigma)
results = pm.Categorical('results', p=cat_p, observed=data)
with model:
start = pm.find_MAP()
trace = pm.sample(2000, start=start)
When running this, I receive the following error:
Applied interval-transform to sigma and added transformed sigma_interval_ to model.
Applied stickbreaking-transform to thresh and added transformed thresh_stickbreaking_ to model.
Traceback (most recent call last):
File "cm_net_log.v1-for_so.py", line 53, in <module>
start = pm.find_MAP()
File "/usr/local/lib/python3.5/site-packages/pymc3/tuning/starting.py", line 133, in find_MAP
specific_errors)
ValueError: Optimization error: max, logp or dlogp at max have non-finite values. Some values may be outside of distribution support. max: {'thresh_stickbreaking_': array([-1.04298465, -0.48661088, -0.84326554, -0.44833646]), 'sigma_interval_': array(-2.220446049250313e-16), 'mu': array(7.68422528308479)} logp: array(-3506.530143064723) dlogp: array([ 1.61013190e-06, nan, -6.73994118e-06,
-6.93873894e-06, 6.03358122e-06, 3.18954680e-06])Check that 1) you don't have hierarchical parameters, these will lead to points with infinite density. 2) your distribution logp's are properly specified. Specific issues:
My questions:
How can I determine why dlogp is nan at certain points?
Is there a different way that I can express this model to avoid dlogp being nan?
Also worth noting:
This model runs fine if I don't find_MAP and use a Metropolis sampler. However, I'd like to have the flexibility of using other samplers as this model becomes more complex.
I have a suspicion that the issue is due to the relationship between the thresholds and the normal distribution, but I don't know how to disentangle them for the optimization.
Regarding question 2: I expressed the model for the ordinal predicted variable (single group) differently; I used the Theano #as_op decorator for a function that calculates probabilities for the outcomes. That also explains why I cannot use find_MAP() or gradient based samplers: Theano cannot calculate a gradient for the custom function. (http://pymc-devs.github.io/pymc3/notebooks/getting_started.html#Arbitrary-deterministics)
# Number of outcomes
nYlevels = df.Y.cat.categories.size
thresh = [k + .5 for k in range(1, nYlevels)]
thresh_obs = np.ma.asarray(thresh)
thresh_obs[1:-1] = np.ma.masked
#as_op(itypes=[tt.dvector, tt.dscalar, tt.dscalar], otypes=[tt.dvector])
def outcome_probabilities(theta, mu, sigma):
out = np.empty(nYlevels)
n = norm(loc=mu, scale=sigma)
out[0] = n.cdf(theta[0])
out[1] = np.max([0, n.cdf(theta[1]) - n.cdf(theta[0])])
out[2] = np.max([0, n.cdf(theta[2]) - n.cdf(theta[1])])
out[3] = np.max([0, n.cdf(theta[3]) - n.cdf(theta[2])])
out[4] = np.max([0, n.cdf(theta[4]) - n.cdf(theta[3])])
out[5] = np.max([0, n.cdf(theta[5]) - n.cdf(theta[4])])
out[6] = 1 - n.cdf(theta[5])
return out
with pm.Model() as ordinal_model_single:
theta = pm.Normal('theta', mu=thresh, tau=np.repeat(.5**2, len(thresh)),
shape=len(thresh), observed=thresh_obs, testval=thresh[1:-1])
mu = pm.Normal('mu', mu=nYlevels/2.0, tau=1.0/(nYlevels**2))
sigma = pm.Uniform('sigma', nYlevels/1000.0, nYlevels*10.0)
pr = outcome_probabilities(theta, mu, sigma)
y = pm.Categorical('y', pr, observed=df.Y.cat.codes.as_matrix())
http://nbviewer.jupyter.org/github/JWarmenhoven/DBDA-python/blob/master/Notebooks/Chapter%2023.ipynb