I am trying to increase my GPU utilization in TensorFlow but I find that subgraph executions are not parallelized.
Here is working example (tensorflow version r.012):
import tensorflow as tf
import numpy as np
from tensorflow.python.client import timeline
#initialize graph
tf.reset_default_graph()
sess = tf.Session()
# some parameters
input_dim = 10000
output_dim = 100
num_hidden = 10000
batch_size = 256
First we create two networks:
#specify two networks with random inputs as data
with tf.device('/gpu:0'):
# first network
with tf.variable_scope('net1'):
tf_data1 = tf.random_normal(shape=[batch_size, input_dim])
w1 = tf.get_variable('w1', shape=[input_dim, num_hidden], dtype=tf.float32)
b1 = tf.get_variable('b1', shape=[num_hidden], dtype=tf.float32)
l1 = tf.add(tf.matmul(tf_data1, w1), b1)
w2 = tf.get_variable('w2', shape=[num_hidden, output_dim], dtype=tf.float32)
result1 = tf.matmul(l1, w2)
# second network
with tf.variable_scope('net2'):
tf_data2 = tf.random_normal(shape=[batch_size, input_dim])
w1 = tf.get_variable('w1', shape=[input_dim, num_hidden], dtype=tf.float32)
b1 = tf.get_variable('b1', shape=[num_hidden], dtype=tf.float32)
l1 = tf.add(tf.matmul(tf_data1, w1), b1)
w2 = tf.get_variable('w2', shape=[num_hidden, output_dim], dtype=tf.float32)
result2 = tf.matmul(l1, w2)
This is what we are interested in:
#the result that we are interested
out = tf.add(result1, result2)
Now we initialize and run the session:
sess.run(tf.global_variables_initializer()) #initialize variables
# run out operation with trace
run_metadata = tf.RunMetadata()
sess.run(out,
options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
run_metadata=run_metadata )
# write trace to file
trace = timeline.Timeline(step_stats=run_metadata.step_stats)
trace_file = open('trace.ctf.json', 'w')
trace_file.write(trace.generate_chrome_trace_format())
In the trace we can see the following:
The first Matmul is for the net1, and the second Matmul is for net2.
Questions:
1 - Since the result1 does not depend on result2 why these operations are not processed in parallel when calling the parent operation ''out''?
2- Am I doing something wrong when defining the graph? From the documentation I understand that Tensorflow does concurrency automatically.
3- Is there any way I can achieve concurrency at this level?
Thanks
Re (1) TensorFlow by default uses a single GPU stream. If you run your code on CPU, you'll see parallelism. To get better GPU utilization it's best to increase your batch size / kernel size instead.
Re (2) your graph seems to be correctly defined. The automatic parallelization mostly applies to the CPU.
Re (3) there is as of 1.0 no way to run multi-compute-stream code on TensorFlow GPU.
Related
Is it possible to get samples from a tensor that depends on a random variable in tensorflow? I need to get an approximate sample distribution to use in a loss function to be optimized. Specifically, in the example below, I want to be able to obtain samples of Y_output in order to be able to calculate the mean and variance of the output distribution and use these parameters in a loss function.
def sample_weight(mean, phi, seed=1):
P_epsilon = tf.distributions.Normal(loc=0., scale=1.0)
epsilon_s = P_epsilon.sample([1])
s = tf.multiply(epsilon_s, tf.log(1.0+tf.exp(phi)))
weight_sample = mean + s
return weight_sample
X = tf.placeholder(tf.float32, shape=[None, 1], name="X")
Y_labels = tf.placeholder(tf.float32, shape=[None, 1], name="Y_labels")
sw0 = sample_weight(u0,p0)
sw1 = sample_weight(u1,p1)
Y_output = sw0 + tf.multiply(sw1,X)
loss = tf.losses.mean_squared_error(labels=Y_labels, predictions=Y_output)
train_op = tf.train.AdamOptimizer(0.5e-1).minimize(loss)
init_op = tf.global_variables_initializer()
losses = []
predictions = []
Fx = lambda x: 0.5*x + 5.0
xrnge = 50
xs, ys = build_toy_data(funcx=Fx, stdev=2.0, num=xrnge)
with tf.Session() as sess:
sess.run(init_op)
iterations=1000
for i in range(iterations):
stat = sess.run(loss, feed_dict={X: xs, Y_labels: ys})
Not sure if this answers your question, but: when you have a Tensor downstream from a sampling Op (e.g., the Op created by your call to P_epsilon.sample([1]), anytime you call sess.run on the downstream Tensor, the sample op will be re-run, and produce a new random value. Example:
import tensorflow as tf
from tensorflow_probability import distributions as tfd
n = tfd.Normal(0., 1.)
s = n.sample()
y = s**2
sess = tf.Session() # Don't actually do this -- use context manager
print(sess.run(y))
# ==> 0.13539088
print(sess.run(y))
# ==> 0.15465781
print(sess.run(y))
# ==> 4.7929106
If you want a bunch of samples of y, you could do
import tensorflow as tf
from tensorflow_probability import distributions as tfd
n = tfd.Normal(0., 1.)
s = n.sample(100)
y = s**2
sess = tf.Session() # Don't actually do this -- use context manager
print(sess.run(y))
# ==> vector of 100 squared random normal values
We also have some cool tools in tensorflow_probability to do the kind of thing you're driving at here. Namely the Bijector API and, somewhat simpler, the trainable_distributions API.
(Another minor point: I'd suggest using tf.nn.softplus, or at a minimum tf.log1p(tf.exp(x)) instead of tf.log(1.0 + tf.exp(x)). The latter has poor numerical properties due to floating point imprecision, which the former are optimized for).
Hope this is some help!
I am new to tensorflow and I am trying to build an image classifier. I have successfully created the model and I am trying to predict a single image after restoring the model. I have gone through various tutorials (https://github.com/sankit1/cv-tricks.com/blob/master/Tensorflow-tutorials/tutorial-2-image-classifier/predict.py) but I can't figure out the feed-dict thing in my code. I am stuck at predict fnction after loading the saved model. Can someone please help me and tell me what to do after loading all the variables from the saved model?
This is the train function which returns the parameters and save them in a model.
def trainModel(train, test, learning_rate=0.0001, num_epochs=2, minibatch_size=32, graph_filename='costs'):
"""
Implements a three-layer tensorflow neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SOFTMAX.
Input:
train : training set
test : test set
learning_rate : learning rate
num_epochs : number of epochs
minibatch_size : size of minibatch
print_cost : True to print the cost every epoch
Returns:
parameters : parameters learnt by the model
"""
ops.reset_default_graph() #for rerunning the model without resetting tf vars
# input and output shapes
(n_x, m) = train.images.T.shape
n_y = train.labels.T.shape[0]
costs = [] #var for storing the costs for later use
# create placeholders
X, Y = placeholderCreator(n_x, n_y)
parameters = paramInitializer()
# Forward propagation
Z3 = forwardPropagation(X, parameters)
# Cost function
cost = costCalc(Z3, Y)
#Backpropagation using adam optimizer
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
# Initialize tf variables
init = tf.global_variables_initializer()
minibatch_size = 32
# Start session to compute Tensorflow graph
with tf.Session() as sess:
# Run initialization
sess.run(init)
for epoch in range(num_epochs): # Training loop
epoch_cost = 0.
num_minibatches = int(m / minibatch_size)
for i in range(num_minibatches):
minibatch_X, minibatch_Y = train.next_batch(minibatch_size) # Get next batch of training data and labels
_, minibatch_cost = sess.run([optimizer, cost], feed_dict={X: minibatch_X.T, Y: minibatch_Y.T}) # Execute optimizer and cost function
epoch_cost += minibatch_cost / num_minibatches # Update epoch cost
saver = tf.train.Saver()
# Save parameters
parameters = sess.run(parameters)
saver.save(sess, "~/trained-model.ckpt")
return parameters
And this is my predict function where I am trying to predict an image. I have converted that image into MNIST format for ease of use (predicting_data). I load the model that I saved, use a softmax function on the output of 3rd layer (final output).
def predict():
train = predicting_data.train
(n_x, m) = train.images.T.shape
n_y = train.labels.T.shape[0]
X, Y = placeholderCreator(n_x, n_y)
with tf.Session() as sess:
new_saver = tf.train.import_meta_graph('~/trained-model.ckpt.meta')
new_saver.restore(sess, '~/trained-model.ckpt')
W1 = tf.get_default_graph().get_tensor_by_name('W1:0')
b1 = tf.get_default_graph().get_tensor_by_name('b1:0')
W2 = tf.get_default_graph().get_tensor_by_name('W2:0')
b2 = tf.get_default_graph().get_tensor_by_name('b2:0')
W3 = tf.get_default_graph().get_tensor_by_name('W3:0')
b3 = tf.get_default_graph().get_tensor_by_name('b3:0')
# forward propagation
Z1 = tf.add(tf.matmul(W1,X), b1)
A1 = tf.nn.relu(Z1)
Z2 = tf.add(tf.matmul(W2,A1), b2)
A2 = tf.nn.relu(Z2)
Z3 = tf.add(tf.matmul(W3,A2), b3)
y_pred = tf.nn.softmax(Z3) ####what to do after this????
cost = sess.run(y_pred, feed_dict={X: train.images.T})
Thank you in advance!
As vijay says in his comment:
Your predict part is not right, you need to get the input and predict tensors from the saved graph using the get_tensor_by_name() function and then use it in your sess.run
If you look at this post, it covers a similar problem and has some code examples.
In your code, you can pass 1 to the next_batch method and get just one image.
minibatch_X, minibatch_Y = train.next_batch(1)
When I try the following code sample for using Tensorflow with Ray, Tensorflow fails to detect the GPU's on my machine when invoked by the "remote" worker but it does find the GPU's when invoked "locally". I put "remote" and "locally" in scare quotes because everything is running on my desktop which has two GPU's and is running Ubuntu 16.04 and I installed Tensorflow using the tensorflow-gpu Anaconda package.
The local_network seems to be responsible for these messages in the logs:
2018-01-26 17:24:33.149634: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro M5000, pci bus id: 0000:03:00.0)
2018-01-26 17:24:33.149642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Quadro M5000, pci bus id: 0000:04:00.0)
And the remote_network seems to be responsible for this message:
2018-01-26 17:24:34.309270: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE
Why is Tensorflow able to detect the GPU in one case but not the other?
import tensorflow as tf
import numpy as np
import ray
ray.init()
BATCH_SIZE = 100
NUM_BATCHES = 1
NUM_ITERS = 201
class Network(object):
def __init__(self, x, y):
# Seed TensorFlow to make the script deterministic.
tf.set_random_seed(0)
# Define the inputs.
x_data = tf.constant(x, dtype=tf.float32)
y_data = tf.constant(y, dtype=tf.float32)
# Define the weights and computation.
w = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
b = tf.Variable(tf.zeros([1]))
y = w * x_data + b
# Define the loss.
self.loss = tf.reduce_mean(tf.square(y - y_data))
optimizer = tf.train.GradientDescentOptimizer(0.5)
self.grads = optimizer.compute_gradients(self.loss)
self.train = optimizer.apply_gradients(self.grads)
# Define the weight initializer and session.
init = tf.global_variables_initializer()
self.sess = tf.Session()
# Additional code for setting and getting the weights
self.variables = ray.experimental.TensorFlowVariables(self.loss, self.sess)
# Return all of the data needed to use the network.
self.sess.run(init)
# Define a remote function that trains the network for one step and returns the
# new weights.
def step(self, weights):
# Set the weights in the network.
self.variables.set_weights(weights)
# Do one step of training. We only need the actual gradients so we filter over the list.
actual_grads = self.sess.run([grad[0] for grad in self.grads])
return actual_grads
def get_weights(self):
return self.variables.get_weights()
# Define a remote function for generating fake data.
#ray.remote(num_return_vals=2)
def generate_fake_x_y_data(num_data, seed=0):
# Seed numpy to make the script deterministic.
np.random.seed(seed)
x = np.random.rand(num_data)
y = x * 0.1 + 0.3
return x, y
# Generate some training data.
batch_ids = [generate_fake_x_y_data.remote(BATCH_SIZE, seed=i) for i in range(NUM_BATCHES)]
x_ids = [x_id for x_id, y_id in batch_ids]
y_ids = [y_id for x_id, y_id in batch_ids]
# Generate some test data.
x_test, y_test = ray.get(generate_fake_x_y_data.remote(BATCH_SIZE, seed=NUM_BATCHES))
# Create actors to store the networks.
remote_network = ray.remote(Network)
actor_list = [remote_network.remote(x_ids[i], y_ids[i]) for i in range(NUM_BATCHES)]
local_network = Network(x_test, y_test)
# Get initial weights of local network.
weights = local_network.get_weights()
# Do some steps of training.
for iteration in range(NUM_ITERS):
# Put the weights in the object store. This is optional. We could instead pass
# the variable weights directly into step.remote, in which case it would be
# placed in the object store under the hood. However, in that case multiple
# copies of the weights would be put in the object store, so this approach is
# more efficient.
weights_id = ray.put(weights)
# Call the remote function multiple times in parallel.
gradients_ids = [actor.step.remote(weights_id) for actor in actor_list]
# Get all of the weights.
gradients_list = ray.get(gradients_ids)
# Take the mean of the different gradients. Each element of gradients_list is a list
# of gradients, and we want to take the mean of each one.
mean_grads = [sum([gradients[i] for gradients in gradients_list]) / len(gradients_list) for i in range(len(gradients_list[0]))]
feed_dict = {grad[0]: mean_grad for (grad, mean_grad) in zip(local_network.grads, mean_grads)}
local_network.sess.run(local_network.train, feed_dict=feed_dict)
weights = local_network.get_weights()
# Print the current weights. They should converge to roughly to the values 0.1
# and 0.3 used in generate_fake_x_y_data.
if iteration % 20 == 0:
print("Iteration {}: weights are {}".format(iteration, weights))
The GPUs are cut off by ray.remote decorator itself. From its source code:
def remote(*args, **kwargs):
...
num_cpus = kwargs["num_cpus"] if "num_cpus" in kwargs else 1
num_gpus = kwargs["num_gpus"] if "num_gpus" in kwargs else 0 # !!!
...
So the following call effectively sets num_gpus=0:
remote_network = ray.remote(Network)
Ray API is a bit strange, and you can't simply say ray.remote(Network, num_gpus=2) (though that's exactly what you want). Here's what I did and it seems to work on my machine:
ray.init(num_gpus=2)
...
#ray.remote(num_gpus=2)
class RemoteNetwork(Network):
pass
actor_list = [RemoteNetwork.remote(x_ids[i],y_ids[i]) for i in range(NUM_BATCHES)]
how can i get the number of flops from tfprof i have the code as:
def calculate_flops():
# Print to stdout an analysis of the number of floating point operations in the
# model broken down by individual operations.
param_stats = tf.contrib.tfprof.model_analyzer.print_model_analysis(
tf.get_default_graph(),
tfprof_options=tf.contrib.tfprof.model_analyzer.
TRAINABLE_VARS_PARAMS_STAT_OPTIONS)
print(param_stats)
but the results says flops = 0.
how can i calculate the number of flops. can i have an example ?
First of all, as of now, tfprof.model_analyzer.print_model_analysis is deprecated and tf.profiler.profile should be used instead according to the official documentation.
Given that we know the number of FLOP, we can get the FLOPS (FLOP per second) of a forward pass by measuring run time of a forward pass and divide FLOP/run_time
Let's take an easy example.
g = tf.Graph()
sess = tf.Session(graph=g)
with g.as_default():
A = tf.Variable(initial_value=tf.random_normal([25, 16]))
B = tf.Variable(initial_value=tf.random_normal([16, 9]))
C = tf.matmul(A,B, name='output')
sess.run(tf.global_variables_initializer())
flops = tf.profiler.profile(g, options=tf.profiler.ProfileOptionBuilder.float_operation())
print('FLOP = ', flops.total_float_ops)
outputs 8288. But why do we get 8288 instead of the expected result 7200=2*25*16*9[a] ? The answer is in the way the tensors A and B are initialised. Initialising with a Gaussian distribution costs some FLOP. Changing the definition of A and B by
A = tf.Variable(initial_value=tf.zeros([25, 16]))
B = tf.Variable(initial_value=tf.zeros([16, 9]))
gives the expected output 7200.
Usually, a network's variables are initialised with Gaussian distributions among other schemes. Most of the time, we are not interested by the initialisation FLOP as they are done once during initialisation and do not happen during the training nor the inference. So, how could one get the exact number of FLOP disregarding the initialisation FLOP?
Freeze the graph with a pb.
The following snippet illustrates this:
import tensorflow as tf
from tensorflow.python.framework import graph_util
def load_pb(pb):
with tf.gfile.GFile(pb, "rb") as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
with tf.Graph().as_default() as graph:
tf.import_graph_def(graph_def, name='')
return graph
# ***** (1) Create Graph *****
g = tf.Graph()
sess = tf.Session(graph=g)
with g.as_default():
A = tf.Variable(initial_value=tf.random_normal([25, 16]))
B = tf.Variable(initial_value=tf.random_normal([16, 9]))
C = tf.matmul(A, B, name='output')
sess.run(tf.global_variables_initializer())
flops = tf.profiler.profile(g, options = tf.profiler.ProfileOptionBuilder.float_operation())
print('FLOP before freezing', flops.total_float_ops)
# *****************************
# ***** (2) freeze graph *****
output_graph_def = graph_util.convert_variables_to_constants(sess, g.as_graph_def(), ['output'])
with tf.gfile.GFile('graph.pb', "wb") as f:
f.write(output_graph_def.SerializeToString())
# *****************************
# ***** (3) Load frozen graph *****
g2 = load_pb('./graph.pb')
with g2.as_default():
flops = tf.profiler.profile(g2, options = tf.profiler.ProfileOptionBuilder.float_operation())
print('FLOP after freezing', flops.total_float_ops)
outputs
FLOP before freezing 8288
FLOP after freezing 7200
[a] Usually the FLOP of a matrix multiplication are mq(2p -1) for the product AB where A[m, p] and B[p, q] but TensorFlow returns 2mpq for some reason. An issue has been opened to understand why.
So I have a facility of 1 CPU with 64 cores. I have installed tensorflow from anaconda. I know that if I had multiple CPUs, I could distribute computation by specifying the CPUids. Like below (adapted from here) :
with tf.device("/cpu:0"):
a = tf.Variable(tf.ones(()))
a = tf.square(a)
with tf.device("/cpu:1"):
b = tf.Variable(tf.ones(()))
b = tf.square(b)
with tf.device("/cpu:2"):
loss = a+b
opt = tf.train.GradientDescentOptimizer(learning_rate=0.1)
train_op = opt.minimize(loss)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(10):
loss0, _ = sess.run([loss, train_op])
print("loss", loss0)
The above example code assumes three CPUs. But I was wondering if I can efficiently do some kind of efficient deep learning exercises with the present facility (1 CPU, 64 cores)? Can someone help or guide me?
UPDATE :
The cores are Intel Xeon Phi processor model.
Also please note that I don't have administrator privilege, so cannot compile any libraries. I installed every python libraries via Anaconda.
My attempt to understand something. I used the Timeline concept (from here) in the above given code like below :
import tensorflow as tf
from tensorflow.python.client import timeline
with tf.device("/cpu:0"):
a = tf.Variable(tf.ones(()))
a = tf.square(a)
with tf.device("/cpu:0"):
b = tf.Variable(tf.ones(()))
b = tf.square(b)
with tf.device("/cpu:0"):
loss = a+b
opt = tf.train.GradientDescentOptimizer(learning_rate=0.1)
train_op = opt.minimize(loss)
sess = tf.Session()
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess.run(tf.global_variables_initializer())
for i in range(10):
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
loss0, _ = sess.run([loss, train_op], options=run_options,run_metadata=run_metadata)
print("loss", loss0)
# Create the Timeline object, and write it to a json
tl = timeline.Timeline(run_metadata.step_stats)
ctf = tl.generate_chrome_trace_format()
with open('timeline_execution1.json', 'w') as f:
f.write(ctf)
And then I generated different json files to see the timeline in chrome with config=tf.ConfigProto(intra_op_parallelism_threads=#,inter_op_parallelism_threads=#) line in tf.Session(). And then I got different outputs. But I understood nothing other than one point. This program is using 4 cores, whatever options I give inside tf.Session(). Like below :
In case you have an Intel CPU (maybe XeonPhi), compiling Tensorflow with MKL might speed up things.
You can see how it's done here