With the following code (uses tensorflow.contrib.nccl.all_sum), I expected to see bytes being transferred over NVLINK.
In reality, I don't.
from tensorflow.contrib.nccl import all_sum
with tf.device('/gpu:0'):
a = tf.get_variable(
"a", initializer=tf.constant(1.0, shape=(args.dim, args.dim)))
with tf.device('/gpu:1'):
b = tf.get_variable(
"b", initializer=tf.constant(2.0, shape=(args.dim, args.dim)))
with tf.device('/gpu:0'):
summed_node = all_sum([a, b])
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True,
log_device_placement=True))
init = tf.global_variables_initializer()
sess.run(init)
with tf.device('/gpu:0'):
summed = sess.run(summed_node)
My machine is an AWS instance of p3.8xlarge. My understanding is, this configuration supports NVLINK.
The execution is fine but when I use nvidia-smi nvlink -g 0 -i 0 the link Tx/Rx counts are zero.
Related
So I have a facility of 1 CPU with 64 cores. I have installed tensorflow from anaconda. I know that if I had multiple CPUs, I could distribute computation by specifying the CPUids. Like below (adapted from here) :
with tf.device("/cpu:0"):
a = tf.Variable(tf.ones(()))
a = tf.square(a)
with tf.device("/cpu:1"):
b = tf.Variable(tf.ones(()))
b = tf.square(b)
with tf.device("/cpu:2"):
loss = a+b
opt = tf.train.GradientDescentOptimizer(learning_rate=0.1)
train_op = opt.minimize(loss)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(10):
loss0, _ = sess.run([loss, train_op])
print("loss", loss0)
The above example code assumes three CPUs. But I was wondering if I can efficiently do some kind of efficient deep learning exercises with the present facility (1 CPU, 64 cores)? Can someone help or guide me?
UPDATE :
The cores are Intel Xeon Phi processor model.
Also please note that I don't have administrator privilege, so cannot compile any libraries. I installed every python libraries via Anaconda.
My attempt to understand something. I used the Timeline concept (from here) in the above given code like below :
import tensorflow as tf
from tensorflow.python.client import timeline
with tf.device("/cpu:0"):
a = tf.Variable(tf.ones(()))
a = tf.square(a)
with tf.device("/cpu:0"):
b = tf.Variable(tf.ones(()))
b = tf.square(b)
with tf.device("/cpu:0"):
loss = a+b
opt = tf.train.GradientDescentOptimizer(learning_rate=0.1)
train_op = opt.minimize(loss)
sess = tf.Session()
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess.run(tf.global_variables_initializer())
for i in range(10):
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
loss0, _ = sess.run([loss, train_op], options=run_options,run_metadata=run_metadata)
print("loss", loss0)
# Create the Timeline object, and write it to a json
tl = timeline.Timeline(run_metadata.step_stats)
ctf = tl.generate_chrome_trace_format()
with open('timeline_execution1.json', 'w') as f:
f.write(ctf)
And then I generated different json files to see the timeline in chrome with config=tf.ConfigProto(intra_op_parallelism_threads=#,inter_op_parallelism_threads=#) line in tf.Session(). And then I got different outputs. But I understood nothing other than one point. This program is using 4 cores, whatever options I give inside tf.Session(). Like below :
In case you have an Intel CPU (maybe XeonPhi), compiling Tensorflow with MKL might speed up things.
You can see how it's done here
I need help in luanching tensorboard from tensorflow running on the datalab,
My code is the followings (everything is on the datalab):
import tensorflow as tf
with tf.name_scope('input'):
print ("X_np")
X_np = tf.placeholder(tf.float32, shape=[None, num_of_features],name="input")
with tf.name_scope('weights'):
print ("W is for weights & - 15 number of diseases")
W = tf.Variable(tf.zeros([num_of_features,15]),name="W")
with tf.name_scope('biases'):
print ("b")
#todo:authemate for more diseases
b = tf.Variable(tf.zeros([15]),name="biases")
with tf.name_scope('layer'):
print ("y_train_np")
y_train_np = tf.nn.softmax(tf.matmul(X_np,W) + b)
with tf.name_scope('correct'):
print ("y_ - placeholder for correct answer")
y_ = tf.placeholder(tf.float32, shape=[None, 15],name="correct_answer")
with tf.name_scope('loss'):
print ("cross entrpy")
cross_entropy = -tf.reduce_sum(y_*tf.log(y_train_np))
# % of correct answers found in batch
print("is correct")
is_correct = tf.equal(tf.argmax(y_train_np,1),tf.argmax(y_,1))
print("accuracy")
accuracy = tf.reduce_mean(tf.cast(is_correct,tf.float32))
print("train step")
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
# train data and get results for batches
print("initialize all varaible")
init = tf.global_variables_initializer()
print("session")
sess = tf.Session()
writer = tf.summary.FileWriter("logs/", sess.graph)
init = tf.global_variables_initializer()
sess.run(init)
!tensorboard --logdir=/logs
the output is:
Starting TensorBoard 41 on port 6006
(You can navigate to http://172.17.0.2:6006)
However, when I click on the link, the webpage is empty
Please let me know what I am missing. I am expecting to see the graph. later i would like to generate more data. Any suggestion is appreciated.
Many thanks!
If you are using datalab, you can use tensorboard as below:
from google.datalab.ml import TensorBoard as tb
tb.start('./logs')
http://googledatalab.github.io/pydatalab/google.datalab.ml.html
You can also create a Cloud AI Platform Notebook instance with TensorBoard support by entering the following command into the Cloud Shell. Afterwards you can simply launch tensorboard when you want from launcher (File->New Launcher-> Tensorboard)
export IMAGE_FAMILY="tf-1-14-cpu"
export ZONE="us-west1-b"
export INSTANCE_NAME="tf-tensorboard-1"
export INSTANCE_TYPE="n1-standard-4"
gcloud compute instances create "${INSTANCE_NAME}" \
--zone="${ZONE}" \
--image-family="${IMAGE_FAMILY}" \
--image-project=deeplearning-platform-release \
--machine-type="${INSTANCE_TYPE}" \
--boot-disk-size=200GB \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--metadata="proxy-mode=project_editors
I am trying to increase my GPU utilization in TensorFlow but I find that subgraph executions are not parallelized.
Here is working example (tensorflow version r.012):
import tensorflow as tf
import numpy as np
from tensorflow.python.client import timeline
#initialize graph
tf.reset_default_graph()
sess = tf.Session()
# some parameters
input_dim = 10000
output_dim = 100
num_hidden = 10000
batch_size = 256
First we create two networks:
#specify two networks with random inputs as data
with tf.device('/gpu:0'):
# first network
with tf.variable_scope('net1'):
tf_data1 = tf.random_normal(shape=[batch_size, input_dim])
w1 = tf.get_variable('w1', shape=[input_dim, num_hidden], dtype=tf.float32)
b1 = tf.get_variable('b1', shape=[num_hidden], dtype=tf.float32)
l1 = tf.add(tf.matmul(tf_data1, w1), b1)
w2 = tf.get_variable('w2', shape=[num_hidden, output_dim], dtype=tf.float32)
result1 = tf.matmul(l1, w2)
# second network
with tf.variable_scope('net2'):
tf_data2 = tf.random_normal(shape=[batch_size, input_dim])
w1 = tf.get_variable('w1', shape=[input_dim, num_hidden], dtype=tf.float32)
b1 = tf.get_variable('b1', shape=[num_hidden], dtype=tf.float32)
l1 = tf.add(tf.matmul(tf_data1, w1), b1)
w2 = tf.get_variable('w2', shape=[num_hidden, output_dim], dtype=tf.float32)
result2 = tf.matmul(l1, w2)
This is what we are interested in:
#the result that we are interested
out = tf.add(result1, result2)
Now we initialize and run the session:
sess.run(tf.global_variables_initializer()) #initialize variables
# run out operation with trace
run_metadata = tf.RunMetadata()
sess.run(out,
options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
run_metadata=run_metadata )
# write trace to file
trace = timeline.Timeline(step_stats=run_metadata.step_stats)
trace_file = open('trace.ctf.json', 'w')
trace_file.write(trace.generate_chrome_trace_format())
In the trace we can see the following:
The first Matmul is for the net1, and the second Matmul is for net2.
Questions:
1 - Since the result1 does not depend on result2 why these operations are not processed in parallel when calling the parent operation ''out''?
2- Am I doing something wrong when defining the graph? From the documentation I understand that Tensorflow does concurrency automatically.
3- Is there any way I can achieve concurrency at this level?
Thanks
Re (1) TensorFlow by default uses a single GPU stream. If you run your code on CPU, you'll see parallelism. To get better GPU utilization it's best to increase your batch size / kernel size instead.
Re (2) your graph seems to be correctly defined. The automatic parallelization mostly applies to the CPU.
Re (3) there is as of 1.0 no way to run multi-compute-stream code on TensorFlow GPU.
I tried to run the following code in Jupyter Notebook, however I got the InvalidArgumentError for the placeholder.
But when I wrote a Python script and ran it in command window, it worked. I want to know how can I ran my code in the Notebook successfully, thanks.
OS: Ubuntu 16.04 LTS
Tensorflow version: 0.12rc (installed from source)
Programs and Output:
Command window:
Actural code:
import tensorflow as tf
import numpy as np
raw_data = np.random.normal(10, 1, 100)
# Define alpha as a constant
alpha = tf.constant(0.05)
# A placeholder is just like a variable, but the value is injected from the
# session
curr_value = tf.placeholder(tf.float32)
# Initialize the previous average to some
prev_avg = tf.Variable(0.)
avg_hist = tf.summary.scalar("running_average", update_avg)
value_hist = tf.summary.scalar("incoming_values", curr_value)
merged = tf.summary.merge_all()
writer = tf.summary.FileWriter("./logs")
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for i in range(len(raw_data)):
summary_str, curr_avg = sess.run([merged, update_avg], feed_dict={curr_value: raw_data[i]})
sess.run(tf.assign(prev_avg, curr_avg))
print(raw_data[i], curr_avg)
writer.add_summary(summary_str, i)
Your raw_data is float64 (default numpy float type) whereas your placeholder is float32 (default tensorflow float type). You should explicitly cast your data to float32
DCGAN
when I run the project , i got the error.
ValueError: Variable d_h0_conv/w/Adam/ does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?
the part of code is below.
the optimizer:
d_optim = tf.train.AdamOptimizer(config.learning_rate, beta1=config.beta1) \
.minimize(self.d_loss, var_list= self.d_vars)
g_optim = tf.train.AdamOptimizer(config.learning_rate, beta1=config.beta1) \
.minimize(self.g_loss, var_list= self.g_vars)
the variables:
self.d_vars = [var for var in t_vars if 'd_' in var.name]
self.g_vars = [var for var in t_vars if 'g_' in var.name]
the operation:
def conv2d(input_, output_dim,
k_h=5, k_w=5, d_h=2, d_w=2, stddev=0.02,
name="conv2d"):
with tf.variable_scope(name):
w = tf.get_variable('w', [k_h, k_w, input_.get_shape()[-1], output_dim],
initializer=tf.truncated_normal_initializer(stddev=stddev))
conv = tf.nn.conv2d(input_, w, strides=[1, d_h, d_w, 1], padding='SAME')
biases = tf.get_variable('biases', [output_dim], initializer=tf.constant_initializer(0.0))
conv = tf.reshape(tf.nn.bias_add(conv, biases), conv.get_shape())
return conv
Environment:
ubuntu14.04 , python2.7 tensorflow 0.12
Thank you for help.
I need help.
I assume you were were running the command to train the network after pulling the data.
I was able to clone the project, pull the image data sets, and run the training command using Python 3.5 on Ubuntu w/Tensorflow 0.12. The commands are only slightly different
(e.g. python3 main.py --dataset mnist --is_train True vs python...)
I know this project support python 2.7 but you able to run the project using python3?