I'm running an mLSTM (multiplicative LSTM) transform (based on mLSTM by OpenAi (just the transform, it is already trained) but it takes a really long time to transform more than ~100,000 docs.
I want it to run on multiple GPUs. I saw some examples but I have no idea how to implement it on this mLSTM transform code.
The specific part that I want to run on multiple GPUs is:
def transform(xs):
tstart = time.time()
xs = [preprocess(x) for x in xs]
lens = np.asarray([len(x) for x in xs])
sorted_idxs = np.argsort(lens)
unsort_idxs = np.argsort(sorted_idxs)
sorted_xs = [xs[i] for i in sorted_idxs]
maxlen = np.max(lens)
offset = 0
n = len(xs)
smb = np.zeros((2, n, hps.nhidden), dtype=np.float32)
for step in range(0, ceil_round_step(maxlen, nsteps), nsteps):
start = step
end = step+nsteps
xsubseq = [x[start:end] for x in sorted_xs]
ndone = sum([x == b'' for x in xsubseq])
offset += ndone
xsubseq = xsubseq[ndone:]
sorted_xs = sorted_xs[ndone:]
nsubseq = len(xsubseq)
xmb, mmb = batch_pad(xsubseq, nsubseq, nsteps)
for batch in range(0, nsubseq, nbatch):
start = batch
end = batch+nbatch
batch_smb = seq_rep(
xmb[start:end], mmb[start:end],
smb[:, offset+start:offset+end, :])
smb[:, offset+start:offset+end, :] = batch_smb
features = smb[0, unsort_idxs, :]
print('%0.3f seconds to transform %d examples' %
(time.time() - tstart, n))
return features
This is just a snippet of the full code (I don't think it's OK to copy the entire code here).

The part you're referring to is not the place that splits the computation across GPUs, it only transforms the data (on a CPU!) and runs the session.
The correct place is one that defines the computational graph, e.g. mlstm method. There are many ways to split graph, e.g. place LSTM cells on different GPUs, so that the input sequence can be processed in parallel:
def mlstm(inputs, c, h, M, ndim, scope='lstm', wn=False):
for idx, x in enumerate(inputs):
with tf.device('/gpu:' + str(i % GPU_COUNT)):
m = tf.matmul(x, wmx) * tf.matmul(h, wmh)
z = tf.matmul(x, wx) + tf.matmul(m, wh) + b
By the way, there is a useful config option in tensorflow log_device_placement that helps to see the execution details in the output. Here's an example:
import tensorflow as tf
# Creates a graph.
with tf.device('/gpu:0'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], name='b')
c = tf.add(a, b)
# Creates a session with log_device_placement set to True.
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
# Prints the following:
# Device mapping:
# /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: <GPU name>, pci bus id: 0000:01:00.0, compute capability: 6.1
# Add: (Add): /job:localhost/replica:0/task:0/device:GPU:0
# b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
# a: (Const): /job:localhost/replica:0/task:0/device:GPU:0


Slow computation on google colab while solving partial differential equation

I 'm using google colab to solve the homogeneous heat equation. I had made a program earlier with scipy using sparse matrices which worked upto N = 10(hyperparameter) but I need to run it for like N = 4... 1000 and thus it won't work on my pc. I therefore converted the code to tensorflow and here I 'm unable to use sparse matrices like I could in sympy but even the GPU/TPU computation is also slow and slower than my pc. Problems that I'm facing in the code and require solution for
1) tf.contrib is removed and thus I 've to use an older version of tensorflow for odeint function. Where is it in 2.0?
2)If the computation can be computed with sparse matrices it could be good since matrices are tridiagonal.I know about sparse_dense_mul() function but that returns dense tensor and it wouldn't do the job. The "func" function applies time independent boundary conditions and then requires matrix multiplication of (nxn) with (nX1) which gives (nX1) with multiple matrices.
Also the program was running faster without I created the class.
Also it's giving this
WARNING: Logging before flag parsing goes to stderr.
W0829 09:12:24.415445 139855355791232]
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
W0829 09:12:24.645356 139855355791232] From /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/integrate/python/ops/ div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
when I run code for loop in range(2, 10) and tqdm does not display and cell keeps running forever but it works fine for in (2, 5) and tqdm bar does appears.
#find a way to use sparse matrices
class Heat:
def __init__(self, N):
self.N = N
self.H = 1/N
self.A = ts.to_dense(ts.SparseTensor(indices=[[0, 0], [0, 1]] + \
[[i, i+j] for i in range(1, N) for j in [-1, 0, 1]] +[[N, N-1], [N, N]],
values=self.H*np.array([1/3, 1/6] + [1/6, 2/3, 1/6]*(N-1) + [1/6, 1/3], dtype=np.float32),
dense_shape=(N+1, N+1 )))
self.D = ts.to_dense(ts.SparseTensor(indices=[[0, 0], [0, 1]] + [[i, i+j] \
for i in range(1, N) for j in [-1, 0, 1]] +[[N, N-1], [N, N]],
values=N*np.array([1-(1), -1 -(-1)] + [-1, 2, -1]*(N-1) + [-1-(-1), 1-(1)], dtype=np.float32),
dense_shape=(N+1, N+1)))
self.domain = tf.linspace(0.0, 1.0, N+1)
def f(k):
if k == 0:
return (1 + math.pi**2)*(math.pi*self.H - math.sin(math.pi*self.H))/(math.pi**2*self.H)
elif k == N:
return -(1 + math.pi**2)*(-math.pi*self.H + math.sin(math.pi*self.H))/(math.pi**2*self.H)
return -2*(1 + math.pi**2)*(math.cos(math.pi*self.H) - 1)*math.sin(math.pi*self.H*k)/(math.pi**2*self.H)
self.F = tf.constant([f(k) for k in range(N+1)], shape=(N+1,), dtype=tf.float32) #caution! shape changed caution caution 1, N+1(problem) is different from N+1,
self.exact = tm.scalar_mul(scalar=np.exp(1), x=tf.sin(math.pi*self.domain))
def error(self):
return np.linalg.norm(self.exact.numpy() - self.approx, 2)
def func (self, y, t):
y = tf.Variable(y)
y = y[0].assign(0.0)
y = y[self.N].assign(0.0)
if self.N**2> 100:
y_dash = tl.matvec(tf.linalg.inv(self.A), tl.matvec(a=tm.negative(self.D), b=y, a_is_sparse=True) + tm.scalar_mul(scalar=math.exp(t), x=self.F)) #caution! shape changed F is (1, N+1) others too
y_dash = tl.matvec(tf.linalg.inv(self.A), tl.matvec(a=tm.negative(self.D), b=y) + tm.scalar_mul(scalar=math.exp(t), x=self.F)) #caution! shape changed F is (1, N+1) others too
y_dash = tf.Variable(y_dash) #!!y_dash performs Hadamard product like multiplication not matrix-like multiplication;returns 2-D
y_dash = y_dash[0].assign(0.0)
y_dash = y_dash[self.N].assign(0.0)
return y_dash
def algo_1(self):
self.approx = tf.contrib.integrate.odeint(
y0=tf.sin(tm.scalar_mul(scalar=math.pi, x=self.domain)),
t=tf.constant([0.0, 1.0]),
def algo_2(self):
self.approx = tf.contrib.integrate.odeint_fixed(
y0=tf.sin(tm.scalar_mul(scalar=math.pi, x=self.domain)),
t=tf.constant([0.0, 1.0]),
dt=tf.constant([self.H**2], dtype=tf.float32),
df = pd.DataFrame(columns=["NumBasis", "Errors"])
Ns = [2**r for r in range(2, 10)]
l =[]
for i in tqdm_notebook(Ns):
heateqn = Heat(i)
l.append([i, heateqn.error()])
df.append({"NumBasis":i, "Errors":heateqn.error()}, ignore_index=True)

tensorflow giving nans when calculating gradient with sparse tensors

The following snippet is from a fairly large piece of code but hopefully I can give all the information necessary:
y2 = tf.matmul(y1,ymask)
dist = tf.norm(ystar-y2,axis=0)
y1 and y2 are 128x30 and ymask is 30x30. ystar is 128x30. dist is 1x30. When ymask is the identity matrix, everything works fine. But when I set it to be all zeros, apart from a single 1 along the diagonal (so as to set all columns but one in y2 to be zero), I get nans for the gradient of dist with respect to y2, using tf.gradients(dist, [y2]). The specific value of dist is [0,0,7.9,0,...], with all the ystar-y2 values being around the range (-1,1) in the third column and zero elsewhere.
I'm pretty confused as to why a numerical issue would occur here, given there are no logs or divisions, is this underflow? Am I missing something in the maths?
For context, I'm doing this to try to train individual dimensions of y, one at a time, using the whole network.
longer version to reproduce:
import tensorflow as tf
import numpy as np
import pandas as pd
batchSize = 128
eta = 0.8
tasks = 30
imageSize = 32**2
groups = 3
tasksPerGroup = 10
trainDatapoints = 10000
w = np.zeros([imageSize, groups * tasksPerGroup])
toyIndex = 0
for toyLoop in range(groups):
m = np.ones([imageSize]) * np.random.randn(imageSize)
for taskLoop in range(tasksPerGroup):
w[:, toyIndex] = m * 0.1 * np.random.randn(1)
toyIndex += 1
xRand = np.random.normal(0, 0.5, (trainDatapoints, imageSize))
taskLabels = np.matmul(xRand, w) + np.random.normal(0,0.5,(trainDatapoints, groups * tasksPerGroup))
DF = np.concatenate((xRand, taskLabels), axis=1)
trainDF = pd.DataFrame(DF[:trainDatapoints, ])
# define graph variables
x = tf.placeholder(tf.float32, [None, imageSize])
W = tf.Variable(tf.zeros([imageSize, tasks]))
b = tf.Variable(tf.zeros([tasks]))
ystar = tf.placeholder(tf.float32, [None, tasks])
ymask = tf.placeholder(tf.float32, [tasks, tasks])
dataLength = tf.cast(tf.shape(ystar)[0],dtype=tf.float32)
y1 = tf.matmul(x, W) + b
y2 = tf.matmul(y1,ymask)
dist = tf.norm(ystar-y2,axis=0)
mse = tf.reciprocal(dataLength) * tf.reduce_mean(tf.square(dist))
grads = tf.gradients(dist, [y2])
trainStep = tf.train.GradientDescentOptimizer(eta).minimize(mse)
# build graph
init = tf.global_variables_initializer()
sess = tf.Session()
randTask = np.random.randint(0, 9)
ymaskIn = np.zeros([tasks, tasks])
ymaskIn[randTask, randTask] = 1
batch = trainDF.sample(batchSize)
batch_xs = batch.iloc[:, :imageSize]
batch_ys = np.zeros([batchSize, tasks])
batch_ys[:, randTask] = batch.iloc[:, imageSize + randTask]
gradOut =, feed_dict={x: batch_xs, ystar: batch_ys, ymask: ymaskIn}), feed_dict={x: batch_xs, ystar: batch_ys, ymask:ymaskIn})
Here's a very simple reproduction:
import tensorflow as tf
with tf.Graph().as_default():
y = tf.zeros(shape=[1], dtype=tf.float32)
dist = tf.norm(y,axis=0)
(grad,) = tf.gradients(dist, [y])
with tf.Session():
[ nan]
The issue is that tf.norm computes sum(x**2)**0.5. The gradient is x / sum(x**2) ** 0.5 (see e.g., so when sum(x**2) is zero we're dividing by zero.
There's not much to be done in terms of a special case: the gradient as x approaches all zeros depends on which direction it's approaching from. For example if x is a single-element vector, the limit as x approaches 0 could either be 1 or -1 depending on which side of zero it's approaching from.
So in terms of solutions, you could just add a small epsilon:
import tensorflow as tf
def safe_norm(x, epsilon=1e-12, axis=None):
return tf.sqrt(tf.reduce_sum(x ** 2, axis=axis) + epsilon)
with tf.Graph().as_default():
y = tf.constant([0.])
dist = safe_norm(y,axis=0)
(grad,) = tf.gradients(dist, [y])
with tf.Session():
[ 0.]
Note that this is not actually the Euclidean norm. It's a good approximation as long as the input is much larger than epsilon.

Reevaluate dependencies of a while loop

I am trying to understand how while loops work in tensorflow. In particular I have a variable, x say, that I update in the while loop, and then I have some values that depends on x, but when running the while loop the values does not seem to be updated when x changes.
The following code where I have tried to implement a simple gradient decent optimizer might illustrate what I mean:
import tensorflow as tf
x = tf.Variable(initial_value=4, dtype=tf.float32, trainable=False)
y = tf.multiply(x,x)
grad = tf.gradients(y, x)
def update_g():
with tf.control_dependencies(grad):
return tf.identity(grad[0])
iterations = tf.placeholder(tf.int32)
i = tf.constant(0, dtype=tf.int32)
g = tf.Variable(initial_value=grad[0], dtype=tf.float32, trainable=False)
c = lambda i_loop, x_loop, g_loop: i_loop < iterations
b = lambda i_loop, x_loop, g_loop: [i_loop+1, tf.assign(x, x_loop - 10*g_loop), update_g()]
l = tf.while_loop(c, b, [i, x, g], back_prop=False, parallel_iterations=1)
with tf.Session() as sess:
res_g =
res_l =, feed_dict={iterations: 10})
res_x =
Running this on tensorflow 1.0 gives this result for me:
[10, -796.0, 8.0]
and the issue is that the value of the gradient is not updated as x changes.
I have tried various variations on the above code, but can not seem to find a version that works. Basically my question is if the above can be made to work, or do I need to rethink the approach.
(Maybe I should add that I am not interested in writing a gradient decent optimizer, I just built this to have something simple and understandable to work with.)
With some help from the other answer I managed to get this working. Posting the complete code here as a second answer:
x = tf.constant(4, dtype=tf.float32)
y = tf.multiply(x,x)
grad = tf.gradients(y, x)
def loop_grad(x_loop):
y2 = tf.multiply(x_loop, x_loop)
return tf.gradients(y2, x_loop)[0]
iterations = tf.placeholder(tf.int32)
i = tf.constant(0, dtype=tf.int32)
c = lambda i_loop, x_loop: i_loop < iterations
b = lambda i_loop, x_loop: [i_loop+1, x_loop - 0.1*loop_grad(x_loop)]
l = tf.while_loop(c, b, [i, x], back_prop=False, parallel_iterations=1)
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.05)
with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) as sess:
res_g =
res_l =, feed_dict={iterations: 100000})
res_x =
changing the learning rate from the code in the question and increasing the number of iterations gives the output:
[100000, 5.1315068e-38]
Which seems to be working. It runs reasonably fast even with high iteration count, so there does not seem to be something really horrible going on with updating the graph in each iteration of the while loop, a fear of which probably was one reason why I didn't opt for this approach from the start.
Having tf.Variable objects as loop variables for while loops is not supported, and will behave in weird nondeterministic ways. Always use tf.assign and friends to update the value of a tf.Variable.

Tensorflow with Buckets Error

I'm trying to train a sequence to sequence model using tensorflow. I see that in the tutorials, buckets help speed up training. So far I'm able to train using just one bucket, and also using just one gpu and multiple buckets using more or less out of the box code, but when I try to use multiple buckets with multiple gpus, I get an error stating
Invalid argument: You must feed a value for placeholder tensor 'gpu_scope_0/encoder50_gpu0' with dtype int32
From the error, I can tell that I'm not declaring the input_feed correctly, so it is expecting the input to be of the size of the largest bucket every time. I'm confused about why this is the case, though, because in the examples that I'm adapting, it does the same thing when initializing the placeholders for the input_feed. As far as I can tell, the tutorials also initialize up to the largest sized bucket, but this error doesn't happen when I use the tutorials' code.
The following is what I think is the relevant initialization code:
self.encoder_inputs = [[] for _ in xrange(self.num_gpus)]
self.decoder_inputs = [[] for _ in xrange(self.num_gpus)]
self.target_weights = [[] for _ in xrange(self.num_gpus)]
self.scope_prefix = "gpu_scope"
for j in xrange(self.num_gpus):
with tf.device("/gpu:%d" % (self.gpu_offset + j)):
with tf.name_scope('%s_%d' % (self.scope_prefix, j)) as scope:
for i in xrange(buckets[-1][0]): # Last bucket is the biggest one.
self.encoder_inputs[j].append(tf.placeholder(tf.int32, shape=[None],
for i in xrange(buckets[-1][1] + 1):
self.decoder_inputs[j].append(tf.placeholder(tf.int32, shape=[None],
self.target_weights[j].append(tf.placeholder(tf.float32, shape=[None],
# Our targets are decoder inputs shifted by one.
self.losses = []
self.outputs = []
# The following loss computation creates the neural network. The specified
# device hosts the trainable tf parameters.
bucket = buckets[0]
i = 0
with tf.device(param_device):
output, loss = tf.nn.seq2seq.model_with_buckets(self.encoder_inputs[i], self.decoder_inputs[i],
[self.decoder_inputs[i][k + 1] for k in
xrange(len(self.decoder_inputs[i]) - 1)],
self.target_weights[0], buckets,
lambda x, y: seq2seq_f(x, y, True),
bucket = buckets[0]
self.encoder_states = []
with tf.device('/gpu:%d' % self.gpu_offset):
with variable_scope.variable_scope(variable_scope.get_variable_scope(),
self.encoder_outputs, self.encoder_states = get_encoder_outputs(self,
if not forward_only:
self.grads = []
print ("past line 297")
done_once = False
for i in xrange(self.num_gpus):
with tf.device("/gpu:%d" % (self.gpu_offset + i)):
with tf.name_scope("%s_%d" % (self.scope_prefix, i)) as scope:
with variable_scope.variable_scope(variable_scope.get_variable_scope(), reuse=True):
#for j, bucket in enumerate(buckets):
output, loss = tf.nn.seq2seq.model_with_buckets(self.encoder_inputs[i],
[self.decoder_inputs[i][k + 1] for k in
xrange(len(self.decoder_inputs[i]) - 1)],
self.target_weights[i], buckets,
lambda x, y: seq2seq_f(x, y, True),
# Training outputs and losses.
if forward_only:
self.outputs, self.losses = tf.nn.seq2seq.model_with_buckets(
self.encoder_inputs, self.decoder_inputs,
[self.decoder_inputs[0][k + 1] for k in xrange(buckets[0][1])],
self.target_weights, buckets, lambda x, y: seq2seq_f(x, y, True),
# If we use output projection, we need to project outputs for decoding.
if self.output_projection is not None:
for b in xrange(len(buckets)):
self.outputs[b] = [
tf.matmul(output, self.output_projection[0]) + self.output_projection[1]
for output in self.outputs[b]
self.bucket_grads = []
self.gradient_norms = []
params = tf.trainable_variables()
opt = tf.train.GradientDescentOptimizer(self.learning_rate)
self.updates = []
with tf.device(aggregation_device):
for g in xrange(self.num_gpus):
for b in xrange(len(buckets)):
gradients = tf.gradients(self.losses[g][b], params)
clipped_grads, norm = tf.clip_by_global_norm(gradients, max_gradient_norm)
opt.apply_gradients(zip(clipped_grads, params), global_step=self.global_step))
and the following is the relevant code when feeding in data:
input_feed = {}
for i in xrange(self.num_gpus):
for l in xrange(encoder_size):
input_feed[self.encoder_inputs[i][l].name] = encoder_inputs[i][l]
for l in xrange(decoder_size):
input_feed[self.decoder_inputs[i][l].name] = decoder_inputs[i][l]
input_feed[self.target_weights[i][l].name] = target_weights[i][l]
# Since our targets are decoder inputs shifted by one, we need one more.
last_target = self.decoder_inputs[i][decoder_size].name
input_feed[last_target] = np.zeros([self.batch_size], dtype=np.int32)
last_weight = self.target_weights[i][decoder_size].name
input_feed[last_weight] = np.zeros([self.batch_size], dtype=np.float32)
# Output feed: depends on whether we do a backward step or not.
if not forward_only:
output_feed = [self.updates[bucket_id], self.gradient_norms[bucket_id], self.losses[bucket_id]]
output_feed = [self.losses[bucket_id]] # Loss for this batch.
for l in xrange(decoder_size): # Output logits.
Right now I'm considering just padding every input up to the bucket size, but I expect that this would lose some of the advantages of bucketing
Turns out the issue with this was not in the feeding of the placeholders, but was later on in my code where I referred to placeholders that weren't initialized. As far as I can tell when I fixed the later issues this error stopped

Interpreting Tensorflow/Tensorboard "subtraction" operation

The following is code adapted from a simple learning example, that I have bent out of shape to understand the Tensorboard graph visualizations:
import tensorflow as tf
import numpy as np
sess = tf.InteractiveSession()
# Create 100 phony x, y data points in NumPy, y = x * 0.1 + 0.3
x_data = np.random.rand(10).astype("float32")
y_data = x_data * 0.1 + 0.3
W = tf.Variable(tf.random_uniform([1], -1.0, 1.0, name = "internal_W"), name = "external_W")
b = tf.Variable(2*tf.zeros([1], name = "internal_b"), name = "doubled_b")
y = (W * x_data + b)
l1 = (y - y_data)
l2 = (y_data - y )
writer = tf.train.SummaryWriter("/tmp/test1", sess.graph_def)
init = tf.initialize_all_variables()
# Launch the graph.
sess = tf.Session()
A sample output of the print statements is:
[ 0.84253538 0.31011301 0.11627766 0.35491142 0.65550905 0.1798114
0.13632762 0.02010157 0.42960873 0.04218956]
[ 0.39195824 0.33384719 0.31269109 0.33873668 0.37154531 0.31962547
0.31487945 0.302194 0.3468895 0.30460477]
[ 0.45057714 -0.02373418 -0.19641343 0.01617473 0.28396374 -0.13981406
-0.17855182 -0.28209242 0.08271924 -0.2624152 ]
[-0.45057714 0.02373418 0.19641343 -0.01617473 -0.28396374 0.13981406
0.17855182 0.28209242 -0.08271924 0.2624152 ]
Clearly, the subtractions are working properly-- the inputs to the subtraction are in different order, and yield different outputs. However, the graph visualization is:
Notice the "Sub" operators, which appear not to reverse the order of the operands as the code does. (Highlighting either operator yields no additional insight.) Am I missing something obvious, or do the node visualizations completely obscure order of operands?
After futzing around with this, my considered answer to my own question is, "Yes, this is working as intended." The inputs to the nodes show only what the inputs are, not any particular relationships to the operation or the node or themselves; indeed, if one added a variable to itself in an operation node, the input variable would show up only once.
This is not a design choice I would have made, but that does seem to be the intent.
I still encourage others who may have more insight to comment or fully answer.