I have a piece of tensorflow code which use control_flow_ops.cond to select which result to use:
import tensorflow as tf
import numpy as np
from tensorflow.python.ops import control_flow_ops
from tensorflow.python.ops import math_ops
from tensorflow.python.client import timeline
import time
with tf.device('/cpu:0'):
a_arr = []
b = tf.Variable(tf.random_normal([1400, 5600]))
c_arr = []
d = tf.Variable(tf.zeros([1, 5600]))
e_arr = []
x = tf.placeholder(tf.int32, [250])
y = tf.placeholder(tf.int32, [250])
tf.scalar_summary('max/x', tf.reduce_max(x))
for i in range(0, 250):
a_arr.append(tf.Variable(tf.random_normal([1, 1400])))
#c = tf.matmul(a_arr[i], b)
**c = control_flow_ops.cond(x[i] < y[i], lambda: tf.matmul(a_arr[i], b), lambda:d)**
e_arr.append(c)
summary = tf.merge_all_summaries()
e_arr.append(summary)
init = tf.initialize_all_variables()
with tf.Session() as sess:
train_writer = tf.train.SummaryWriter('tensor_summary/train',
sess.graph)
sess.run(init)
xi = [ 1 for i in range(0, 250) ]
yi = [ 0 for i in range(0, 250) ]
print(np.sum(xi < yi))
for i in range(1000):
time_s = time.time()
out_arr = sess.run(e_arr, feed_dict={x:xi, y:yi})
train_writer.add_summary(out_arr[-1], 1)
time_e = time.time()
print('duration = %f' %(time_e - time_s))
Here tf.MatMul should not be executed, but it is actually executed, I run it on tensorflow 0.10.0, and on 32 core CPU, which use more than 900 CPU, and the execution time is 13ms, saving timeline data shows tf.MatMul is also executed.
This is a test case to test tensorflow control_flow_ops.cond, which is also used in bidirectional_rnn.
How could avoid executing tf.MatMul in this case while still make use of control_flow_ops.cond to dynamically select one out of the two results?
Is there any settings?
The MatMuls are not really executed, even though the timeline includes them. It is a bit confusing and we will consider to remove them. If you time both with and without cond, you should be able to see the difference.
Related
I try to generate independent random variable with tensorflow distributed code on different GPU.
I use the split method on the random generator to generate n subgenerators for my n GPU.
Then i'd like to run some code distributed with each GPU using its own subgenerator.
The code below is an example of what i'd like to do. I fails on GPU.
import numpy as np
import tensorflow as tf
import time
import sys, os
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--n_gpus', type=int, default=1)
args = parser.parse_args()
n_gpus = args.n_gpus
device_type = "GPU"
devices = tf.config.experimental.list_physical_devices(
device_type)
devices_names = [d.name.split('e:')[1] for d in devices]
strategy = tf.distribute.MirroredStrategy( devices=devices_names[:n_gpus])
with strategy.scope():
optimizerControl= tf.keras.optimizers.Adam(learning_rate = 1e-3)
modelControl = tf.keras.Sequential([tf.keras.layers.Dense(8, activation = tf.nn.relu),
tf.keras.layers.Dense(1 )])
#tf.function
def cal( locGen, nbSimul, modelControl):
x= locGen.normal( [nbSimul])
return tf.reduce_sum(tf.square(modelControl(tf.expand_dims(x, axis=-1))[:,0]-tf.square(x)))
def train_step(newGen, nbSimul, modelControl, optimizerControl):
i = tf.distribute.get_replica_context().replica_id_in_sync_group
print("Devic run", i)
with tf.GradientTape() as tape:
loss = cal( newGen[i], nbSimul, modelControl)
gradients = tape.gradient(loss, modelControl.trainable_variables)
optimizerControl.apply_gradients(zip(gradients, modelControl.trainable_variables))
return loss
def distributed_train_step(newGen,nbSimul, modelControl, optimizerControl):
per_replica_losses = strategy.run(train_step, args=(newGen,int(nbSimul/n_gpus), modelControl, optimizerControl,))
return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses,
axis=None)/nbSimul
gen = tf.random.Generator.from_seed(1)
newGen = gen.split(n_gpus)
batchSize=10
for epoch in range(10):
valTest = distributed_train_step(newGen,batchSize,modelControl,optimizerControl)
If someone has any idea..
Thank you
I need to train a GPR model in multiple batches per epoch using a custom loss function. I would like to do this using GPflow and I would like to compile my training using tf.function to increase the efficiency. However, gpflow.GPR must be re-instantiated each time you supply new data, so tf.function will have to re-trace each time. This makes the code slower rather than faster.
This is the initial setup:
import numpy as np
from itertools import islice
import tensorflow as tf
import tensorflow_probability as tfp
tfb = tfp.bijectors
from sklearn.model_selection import train_test_split
import gpflow
from gpflow.kernels import SquaredExponential
import time
data_size = 1000
train_fract = 0.8
batch_size = 250
n_epochs = 3
iterations_per_epoch = int(train_fract * data_size/batch_size)
tf.random.set_seed(3)
# Generate dummy data
x = np.arange(data_size)
y = np.arange(data_size) + np.random.rand(data_size)
# Slice into train and validate sets
x_train, x_validate, y_train, y_validate = train_test_split(x, y, random_state = 1, test_size = 1-train_fract )
# Convert data into tensorflow constants
x_train = tf.constant(x_train[:, np.newaxis], dtype=np.float64)
x_validate = tf.constant(x_validate[:, np.newaxis], dtype=np.float64)
y_train = tf.constant(y_train[:, np.newaxis], dtype=np.float64)
y_validate = tf.constant(y_validate[:, np.newaxis], dtype=np.float64)
# Batch data
batched_dataset = (
tf.data.Dataset.from_tensor_slices((x_train, y_train))
.shuffle(buffer_size=len(x_train), seed=1)
.repeat(count=None)
.batch(batch_size)
)
# Create kernel
constrain_positive = tfb.Shift(np.finfo(np.float64).tiny)(tfb.Exp())
amplitude = tfp.util.TransformedVariable(initial_value=1, bijector=constrain_positive, dtype=np.float64, name="amplitude")
len_scale = tfp.util.TransformedVariable(initial_value=10, bijector=constrain_positive, dtype=np.float64, name="len_scale")
kernel = SquaredExponential(variance=amplitude, lengthscales=len_scale, name="squared_exponential_kernel")
obs_noise = tfp.util.TransformedVariable(initial_value=1e-3, bijector=constrain_positive, dtype=np.float64, name="observation_noise")
# Define custom loss function
#tf.function(autograph=False, experimental_compile=False)
def my_custom_loss(y_predict, y_true):
return tf.math.reduce_mean(tf.math.squared_difference(y_predict, y_true))
#optimizer = tf.keras.optimizers.Adam(learning_rate=0.1)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)
This is how I train without a tf.function:
gpr_model_j_i = gpflow.models.GPR(data=(x_train, y_train), kernel=kernel, noise_variance=obs_noise)
# Start training loop
for j in range(n_epochs):
for i, (x_train_j_i, y_train_j_i) in enumerate(islice(batched_dataset, iterations_per_epoch)):
with tf.GradientTape() as tape:
gpr_model_j_i = gpflow.models.GPR(data=(x_train_j_i, y_train_j_i), kernel=kernel, noise_variance=gpr_model_j_i.likelihood.variance)
y_predict_j_i = gpr_model_j_i.predict_f(x_validate)[0]
loss_j_i = my_custom_loss(y_predict_j_i, y_validate)
grads_j_i = tape.gradient(loss_j_i, gpr_model_j_i.trainable_variables)
optimizer.apply_gradients(zip(grads_j_i, gpr_model_j_i.trainable_variables))
This is how I train with a tf.function:
#tf.function(autograph=False, experimental_compile=False)
def tf_function_attempt_3(model): #, optimizer):
with tf.GradientTape() as tape:
y_predict_j_i = model.predict_f(x_validate)[0]
loss_j_i = my_custom_loss(y_predict_j_i, y_validate)
grads_j_i = tape.gradient(loss_j_i, model.trainable_variables)
optimizer.apply_gradients(zip(grads_j_i, model.trainable_variables))
print("TRACING...", end="")
for j in range(n_epochs):
for i, (x_train_j_i, y_train_j_i) in enumerate(islice(batched_dataset, iterations_per_epoch)):
gpr_model_j_i = gpflow.models.GPR(data=(x_train_j_i, y_train_j_i), kernel=kernel, noise_variance=gpr_model_j_i.likelihood.variance)
tf_function_attempt_3(gpr_model_j_i)#, optimizer)
The tf.function retraces for each batch and is significantly slower than the normal training.
Is there a way to speed up the batched training of my GPR model with tf.function while using a custom loss function and GPflow? If not, I am open to suggestions for an alternative approach.
You don't have to re-instantiate GPR each time. You can construct tf.Variable holders with unconstrained shape and then .assign to them:
import gpflow
import numpy as np
import tensorflow as tf
input_dim = 1
initial_x, initial_y = np.zeros((0, input_dim)), np.zeros((0, 1)) # or your first batch
x_var = tf.Variable(initial_x, shape=(None, input_dim), dtype=tf.float64)
y_var = tf.Variable(initial_y, shape=(None,1), dtype=tf.float64)
# in principle you could also set shape=(None, None)...
m = gpflow.models.GPR((x_var, y_var), gpflow.kernels.SquaredExponential())
loss = m.training_loss_closure() # compile=True default wraps in tf.function()
N1 = 3
x1, y1 = np.random.randn(N1, input_dim), np.random.randn(N1, 1)
m.data[0].assign(x1)
m.data[1].assign(y1)
loss() # traces the first time
N2 = 7
x2, y2 = np.random.randn(N2, input_dim), np.random.randn(N2, 1)
m.data[0].assign(x2)
m.data[1].assign(y2)
loss() # does not trace again
I'm trying to construct a little educational example for multivariate linear regresssion, but the LOSS is increasing until it explodes rather than getting smaller, any idea?
import tensorflow as tf
tf.__version__
import numpy as np
data = np.array(
[
[100,35,35,12,0.32],
[101,46,35,21,0.34],
[130,56,46,3412,12.42],
[131,58,48,3542,13.43]
]
)
x = data[:,1:-1]
y_target = data[:,-1]
def loss_function(y, pred):
return tf.reduce_mean(tf.square(y - pred))
def train(b, w, x, y, lr=0.012):
with tf.GradientTape() as t:
current_loss = loss_function(y, linear_model(x))
lr_weight, lr_bias = t.gradient(current_loss, [w, b])
w.assign_sub(lr * lr_weight)
b.assign_sub(lr * lr_bias)
epochs = 80
for epoch_count in range(epochs):
real_loss = loss_function(y_target, linear_model(x))
train(b, w, x, y_target, lr=0.12)
print(f"Epoch count {epoch_count}: Loss value: {real_loss.numpy()}")
This even happens if I initialize the weights with the "correct" values (found out via a scikit-learn regressor)
w = tf.Variable([-1.76770250e-04,3.46688912e-01,2.43827475e-03],dtype=tf.float64)
b = tf.Variable(-11.837184241807234,dtype=tf.float64)
Here's how you might use a TF2 optimizer for a toy example (as per the comment). I know this is not the answer but I didn't want to post this in the comments section, as it will mess up the indentation and all that.
tf_x = tf.Variable(tf.constant(2.0,dtype=tf.float32),name='x')
optimizer = tf.optimizers.SGD(learning_rate=0.1)
# Optimizing tf_x using gradient tape
x_series, y_series = [],[]
for step in range(5):
x_series.append(tf_x.numpy().item())
with tf.GradientTape() as tape:
tf_y = tf_x**2
gradients = tape.gradient(tf_y, tf_x)
optimizer.apply_gradients(zip([gradients], [tf_x]))
Based on #thushv89's input, I'm providing here an intermediate solution using a TF2 Optimizer which is working, although this is not 100% answering my question
import tensorflow as tf
tf.__version__
import numpy as np
data = np.array(
[
[100,35,35,12,0.32],
[101,46,35,21,0.34],
[130,56,46,3412,12.42],
[131,58,48,3542,13.43]
]
)
x = data[:,1:-1]
y_target = data[:,-1]
w = tf.Variable([1,1,1],dtype=tf.float64)
b = tf.Variable(1,dtype=tf.float64)
def linear_model(x):
return b + tf.tensordot(x,w,axes=1)
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.MeanSquaredLogarithmicError()
def train_step(x, y):
with tf.GradientTape() as tape:
predicted = linear_model(x)
loss_value = loss_object(y, predicted)
print(f"Loss Value:{loss_value}")
grads = tape.gradient(loss_value, [b,w])
optimizer.apply_gradients(zip(grads, [b,w]))
def train(epochs):
for epoch in range(epochs):
train_step(x, y_target)
print ('Epoch {} finished'.format(epoch))
train(epochs = 1000)
I have run into an issue where batch learning in tensorflow fails to converge to the correct solution for a simple convex optimization problem, whereas SGD converges. A small example is found below, in the Julia and python programming languages, I have verified that the same exact behaviour results from using tensorflow from both Julia and python.
I'm trying to fit the linear model y = s*W + B with parameters W and B
The cost function is quadratic, so the problem is convex and should be easily solved using a small enough step size. If I feed all data at once, the end result is just a prediction of the mean of y. If, however, I feed one datapoint at the time (commented code in julia version), the optimization converges to the correct parameters very fast.
I have also verified that the gradients computed by tensorflow differs between the batch example and summing up the gradients for each datapoint individually.
Any ideas on where I have failed?
using TensorFlow
s = linspace(1,10,10)
s = [s reverse(s)]
y = s*[1,4] + 2
session = Session(Graph())
s_ = placeholder(Float32, shape=[-1,2])
y_ = placeholder(Float32, shape=[-1,1])
W = Variable(0.01randn(Float32, 2,1), name="weights1")
B = Variable(Float32(1), name="bias3")
q = s_*W + B
loss = reduce_mean((y_ - q).^2)
train_step = train.minimize(train.AdamOptimizer(0.01), loss)
function train_critic(s,targets)
for i = 1:1000
# for i = 1:length(y)
# run(session, train_step, Dict(s_ => s[i,:]', y_ => targets[i]))
# end
ts = run(session, [loss,train_step], Dict(s_ => s, y_ => targets))[1]
println(ts)
end
v = run(session, q, Dict(s_ => s, y_ => targets))
plot(s[:,1],v, lab="v (Predicted value)")
plot!(s[:,1],y, lab="y (Correct value)")
gui();
end
run(session, initialize_all_variables())
train_critic(s,y)
Same code in python (I'm not a python user so this might be ugly)
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets
import tensorflow as tf
from tensorflow.python.framework.ops import reset_default_graph
s = np.linspace(1,10,50).reshape((50,1))
s = np.concatenate((s,s[::-1]),axis=1).astype('float32')
y = np.add(np.matmul(s,[1,4]), 2).astype('float32')
reset_default_graph()
rng = np.random
s_ = tf.placeholder(tf.float32, [None, 2])
y_ = tf.placeholder(tf.float32, [None])
weight_initializer = tf.truncated_normal_initializer(stddev=0.1)
with tf.variable_scope('model'):
W = tf.get_variable('W', [2, 1],
initializer=weight_initializer)
B = tf.get_variable('B', [1],
initializer=tf.constant_initializer(0.0))
q = tf.matmul(s_, W) + B
loss = tf.reduce_mean(tf.square(tf.sub(y_ , q)))
optimizer = tf.train.AdamOptimizer(learning_rate=0.1)
train_op = optimizer.minimize(loss)
num_epochs = 200
train_cost= []
with tf.Session() as sess:
init = tf.initialize_all_variables()
sess.run(init)
for e in range(num_epochs):
feed_dict_train = {s_: s, y_: y}
fetches_train = [train_op, loss]
res = sess.run(fetches=fetches_train, feed_dict=feed_dict_train)
train_cost = [res[1]]
print train_cost
The answer turned out to be that when I fed in the targets, I fed a vector and not an Nx1 matrix. The operation y_-q then turned into a broadcast operation and instead of returning the elementwise difference, it returned an NxN matrix with the desired difference along the diagonal. In Julia, I solved this by modifying the line
train_critic(s,y)
to
train_critic(s,reshape(y, length(y),1))
to ensure y being a matrix.
A subtle error that took me a very long time to find! Part of the confusion was that TensorFlow seems to treat vectors as row vectors and not as column vectors like Julia, hence the broadcast operation in y_-q
Here is a sample:
import numpy as np
import tensorflow as tf
from tensorflow.python.ops import rnn, rnn_cell
if __name__ == '__main__':
embs = tf.Variable(np.random.random((40,5)),dtype=tf.float32)
X = np.array(np.array(range(1,25)).reshape(4, 6))
x0 = tf.placeholder(tf.int32, [None, None])
x1 = tf.nn.embedding_lookup(embs, x0)
lstm = tf.nn.rnn_cell.BasicLSTMCell(5,state_is_tuple=True)
outputs, states = tf.nn.dynamic_rnn(lstm, x1, dtype=tf.float32,time_major = True)
cost = tf.reduce_mean(outputs[:,-1,:])
optimizer = tf.train.AdagradOptimizer(learning_rate=0.12).minimize(cost)
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
result3, opt = sess.run([outputs, optimizer],{x0:X})
I use just one slice of outputs which is outputs[:,-1,:] to get a cost function. When I run the code, I got the result
F ./tensorflow/core/framework/tensor.h:581] Check failed: new_num_elements == NumElements() (0 vs. 20)
How to fix this? It's just a sample. I met this problem when I implement a hierarchical LSTM in which the representations of sentences computed by a LSTM is feed into another LSTM.
I confirmed that this is a bug in TensorFlow 0.10. Upgrading to TensorFlow 0.11 will fix the problem.