how to calculate the flops from tfprof in tensorflow? - tensorflow

how can i get the number of flops from tfprof i have the code as:
def calculate_flops():
# Print to stdout an analysis of the number of floating point operations in the
# model broken down by individual operations.
param_stats = tf.contrib.tfprof.model_analyzer.print_model_analysis(
tf.get_default_graph(),
tfprof_options=tf.contrib.tfprof.model_analyzer.
TRAINABLE_VARS_PARAMS_STAT_OPTIONS)
print(param_stats)
but the results says flops = 0.
how can i calculate the number of flops. can i have an example ?

First of all, as of now, tfprof.model_analyzer.print_model_analysis is deprecated and tf.profiler.profile should be used instead according to the official documentation.
Given that we know the number of FLOP, we can get the FLOPS (FLOP per second) of a forward pass by measuring run time of a forward pass and divide FLOP/run_time
Let's take an easy example.
g = tf.Graph()
sess = tf.Session(graph=g)
with g.as_default():
A = tf.Variable(initial_value=tf.random_normal([25, 16]))
B = tf.Variable(initial_value=tf.random_normal([16, 9]))
C = tf.matmul(A,B, name='output')
sess.run(tf.global_variables_initializer())
flops = tf.profiler.profile(g, options=tf.profiler.ProfileOptionBuilder.float_operation())
print('FLOP = ', flops.total_float_ops)
outputs 8288. But why do we get 8288 instead of the expected result 7200=2*25*16*9[a] ? The answer is in the way the tensors A and B are initialised. Initialising with a Gaussian distribution costs some FLOP. Changing the definition of A and B by
A = tf.Variable(initial_value=tf.zeros([25, 16]))
B = tf.Variable(initial_value=tf.zeros([16, 9]))
gives the expected output 7200.
Usually, a network's variables are initialised with Gaussian distributions among other schemes. Most of the time, we are not interested by the initialisation FLOP as they are done once during initialisation and do not happen during the training nor the inference. So, how could one get the exact number of FLOP disregarding the initialisation FLOP?
Freeze the graph with a pb.
The following snippet illustrates this:
import tensorflow as tf
from tensorflow.python.framework import graph_util
def load_pb(pb):
with tf.gfile.GFile(pb, "rb") as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
with tf.Graph().as_default() as graph:
tf.import_graph_def(graph_def, name='')
return graph
# ***** (1) Create Graph *****
g = tf.Graph()
sess = tf.Session(graph=g)
with g.as_default():
A = tf.Variable(initial_value=tf.random_normal([25, 16]))
B = tf.Variable(initial_value=tf.random_normal([16, 9]))
C = tf.matmul(A, B, name='output')
sess.run(tf.global_variables_initializer())
flops = tf.profiler.profile(g, options = tf.profiler.ProfileOptionBuilder.float_operation())
print('FLOP before freezing', flops.total_float_ops)
# *****************************
# ***** (2) freeze graph *****
output_graph_def = graph_util.convert_variables_to_constants(sess, g.as_graph_def(), ['output'])
with tf.gfile.GFile('graph.pb', "wb") as f:
f.write(output_graph_def.SerializeToString())
# *****************************
# ***** (3) Load frozen graph *****
g2 = load_pb('./graph.pb')
with g2.as_default():
flops = tf.profiler.profile(g2, options = tf.profiler.ProfileOptionBuilder.float_operation())
print('FLOP after freezing', flops.total_float_ops)
outputs
FLOP before freezing 8288
FLOP after freezing 7200
[a] Usually the FLOP of a matrix multiplication are mq(2p -1) for the product AB where A[m, p] and B[p, q] but TensorFlow returns 2mpq for some reason. An issue has been opened to understand why.

Related

Sampling from tensor that depends on a random variable in tensorflow

Is it possible to get samples from a tensor that depends on a random variable in tensorflow? I need to get an approximate sample distribution to use in a loss function to be optimized. Specifically, in the example below, I want to be able to obtain samples of Y_output in order to be able to calculate the mean and variance of the output distribution and use these parameters in a loss function.
def sample_weight(mean, phi, seed=1):
P_epsilon = tf.distributions.Normal(loc=0., scale=1.0)
epsilon_s = P_epsilon.sample([1])
s = tf.multiply(epsilon_s, tf.log(1.0+tf.exp(phi)))
weight_sample = mean + s
return weight_sample
X = tf.placeholder(tf.float32, shape=[None, 1], name="X")
Y_labels = tf.placeholder(tf.float32, shape=[None, 1], name="Y_labels")
sw0 = sample_weight(u0,p0)
sw1 = sample_weight(u1,p1)
Y_output = sw0 + tf.multiply(sw1,X)
loss = tf.losses.mean_squared_error(labels=Y_labels, predictions=Y_output)
train_op = tf.train.AdamOptimizer(0.5e-1).minimize(loss)
init_op = tf.global_variables_initializer()
losses = []
predictions = []
Fx = lambda x: 0.5*x + 5.0
xrnge = 50
xs, ys = build_toy_data(funcx=Fx, stdev=2.0, num=xrnge)
with tf.Session() as sess:
sess.run(init_op)
iterations=1000
for i in range(iterations):
stat = sess.run(loss, feed_dict={X: xs, Y_labels: ys})
Not sure if this answers your question, but: when you have a Tensor downstream from a sampling Op (e.g., the Op created by your call to P_epsilon.sample([1]), anytime you call sess.run on the downstream Tensor, the sample op will be re-run, and produce a new random value. Example:
import tensorflow as tf
from tensorflow_probability import distributions as tfd
n = tfd.Normal(0., 1.)
s = n.sample()
y = s**2
sess = tf.Session() # Don't actually do this -- use context manager
print(sess.run(y))
# ==> 0.13539088
print(sess.run(y))
# ==> 0.15465781
print(sess.run(y))
# ==> 4.7929106
If you want a bunch of samples of y, you could do
import tensorflow as tf
from tensorflow_probability import distributions as tfd
n = tfd.Normal(0., 1.)
s = n.sample(100)
y = s**2
sess = tf.Session() # Don't actually do this -- use context manager
print(sess.run(y))
# ==> vector of 100 squared random normal values
We also have some cool tools in tensorflow_probability to do the kind of thing you're driving at here. Namely the Bijector API and, somewhat simpler, the trainable_distributions API.
(Another minor point: I'd suggest using tf.nn.softplus, or at a minimum tf.log1p(tf.exp(x)) instead of tf.log(1.0 + tf.exp(x)). The latter has poor numerical properties due to floating point imprecision, which the former are optimized for).
Hope this is some help!

adding value to tensorboard for adversarial learning

I'm new to tensorboard. I have faced some problem while using it.
Problem 1 :
I'm writing an adversarial learning model. For visualizing the loss of the model I have the following loss,
actor loss
critic loss
for the learning algorithm provided in this paper,
in one(or K) batch I have to feed actor and critic both. Then I need to only feed value to the critic. This time there is no actor. I think, to show value in tensorboard I need to do following,
def model():
...
actor_loss = ...
tf.summary.scalar('actor', actor_loss)
...
critic_loss = ...
tf.summary.scalar('critic', critic_loss)
my_graph = tf.Graph()
with my_graph.as_default():
tf.reset_default_graph()
sess = tf.Session()
with sess.as_default():
model()
merged = tf.summary.merge_all()
writer = tf.summary.FileWriter(address+ '/train',
sess.graph)
init = tf.global_variables_initializer()
sess.run(init)
Now while giving input to inner_loop (where actor and critic both participated) there's no problem, we get the result by following,
a,b,c,d,summary = sess.run( [actor_train_step, critic_train_step, actor_loss, critic_loss, merged], feed_dict = feed_dict )
writer.add_summary(summary, batch)
but when we want to give input only to the critic, then the code becomes following,
a,b,summary = sess.run( [critic_train_step, critic_loss, merged], feed_dict = feed_dict )
writer.add_summary(summary, batch)
But as merged have dependency over actor_loss it cannot run. On the other side, I can't just feed value to of actor to the model. How how to solve this issue?
Problem 2
I am not evaluating (calculating the score value) the model by tensor operation. Actually, I generate the output and fed the output to another script and got the score value from there. So after each of the batch/epoch I am evaluating my model and got a score value from the script. How can I save this value to tensorboard?
I can not formalize a tf.summary.merge_all() operation before the session initialization as I am calculating the evaluation score value at the training time from outside script.
Where should I put the tf.summary.merge_all() operation?
Now if I want to combine the Problem 1 and Problem 2 to in a single project is there anything new I have to do.
Note: I'm new to tensorboard. So it will be better if you can give a detailed explanation.
Problem #1
If you only want to summary only the critic op, you should only run the summary op for the critic part instead of using tf.summary.merge_all()
For example:
def model():
...
actor_loss = ...
tf.summary.scalar('actor', actor_loss)
...
critic_loss = ...
summary_critic = tf.summary.scalar('critic', critic_loss)
a,b,summary = sess.run( [critic_train_step, critic_loss, summary_critic], feed_dict = feed_dict )
writer.add_summary(summary, batch)
Problem #2
To visualize the values you got after running the outside script. You can convert those values to tensor using tf.convert_to_tensor(), which is documented here. Then serializing that tensor to visualize it on tensorboard.
For example:
vals = output_from_outside_script()
vals_tensor = sess.run(tf.convert_to_tensor(vals))
tf.summary.scalar('evaluation', vals_tensor)
Every tf.summary operations will create a Summary protobuf which serializing your tensor to an events file. And instead of running all the summary ops, Tensorflow provide tf.summary.merger_all() to run all the summary ops in your graph.
I tried to do it in your case.
Outside script:
import numpy as np
def output_from_outside_script(var):
return np.sum(var)
Code in adversarial training:
import tensorflow as tf
import numpy as np
from outside_evaluation import *
sess = tf.Session()
x = sess.run(tf.constant([[1,2,3,4]], dtype=tf.float32))
X = tf.placeholder(dtype=tf.float32, shape=[1, 4])
W = tf.Variable(tf.truncated_normal([4, 10], stddev=0.1))
sess.run(tf.global_variables_initializer())
val = tf.matmul(a=X, b=W, name='matmul')
tf.summary.scalar('matmul_mean', tf.reduce_mean(val))
y = sess.run(val, feed_dict={X: x})
print('y = ', y)
vals = output_from_outside_script(y)
print('vals = ', vals)
vals_tensor = tf.convert_to_tensor(vals, name='vals_tensor')
tf.summary.scalar('evaluation', vals_tensor)
writer = tf.summary.FileWriter(os.path.join('test_log'), sess.graph)
merged = tf.summary.merge_all()
summary = sess.run(merged, feed_dict={X: x})
writer.add_summary(summary)
writer.close()
Output:
('y = ', array([[-0.51137048, -0.16054343, -0.03827953, 0.1124011 , 0.09200752,
-0.22235785, 0.41357356, 1.04061067, -0.08877556, -0.86647421]],
('vals = ', -0.22920817)
Tensorboard log:
Scalar:
Should there be any problem, please let me know.

How to find intermediate outputs of LSTM by running tf.nn.dynamic_rnn in tensorflow

I am new to tensorflow and have recently read about LSTM from various blogs like Understanding LSTM Networks, Colah, The Unreasonable Effectiveness of Recurrent Neural Networks, Karparthy etc.
I found this Code on the web:
import numpy as np
import tensorflow as tf
def length(sequence):
used = tf.sign(tf.reduce_max(tf.abs(sequence), reduction_indices=2))
length = tf.reduce_sum(used, reduction_indices=1)
length = tf.cast(length, tf.int32)
return length
num_neurons = 10
num_layers = 3
max_length = 8
frame_size = 5
# dropout = tf.placeholder(tf.float32)
cell = tf.contrib.rnn.LSTMCell(num_neurons, state_is_tuple= True)
# cell = DropoutWrapper(cell, output_keep_prob=dropout)
cell = tf.contrib.rnn.MultiRNNCell([cell] * num_layers)
sequence = tf.placeholder(tf.float32, [None, max_length, frame_size])
output, state = tf.nn.dynamic_rnn(
cell,
sequence,
dtype=tf.float32,
sequence_length=length(sequence),
)
if __name__ == '__main__':
sample = np.random.random((8, max_length, frame_size)) + 0.1
# sample[np.ix_([0,1],range(50,max_length))] = 0
# drop = 0.2
with tf.Session() as sess:
init_op = init_op = tf.global_variables_initializer()
sess.run(init_op)
o, s = sess.run([output, state], feed_dict={sequence: sample})
# print "Output shape is ", o.shape()
# print "state shape is ", s.shape()
print "Output is ", o
print "State is ", s
Pertaining to the above code with state_is_tuple= True, I have some doubts.
Q. What is the simple meaning of outputs and state which tf.nn.dynamic_rnn returns.
I read on the internet that output is the output of last layer at several time steps and
state is the final state.
My intermediate doubt is, what do we mean by "output of last layer at several time steps"
I looked into dynamic_rnn code as my main task is to find
(https://github.com/tensorflow/tensorflow/blob/r1.1/tensorflow/python/ops/rnn.py)
Q. ***All the intermediate output of LSTM by calling dynamic_rnn in the same fashion as the above code. How can I do it.
I also read dynamic_rnn internally calls _dynamic_rnn.
This _dynamic_rnn returns final_output and final_state. Apart from final_output. I want all the intermediate outputs.
My take is to write custom _dynamic_rnn as defined in
https://github.com/tensorflow/tensorflow/blob/r1.1/tensorflow/python/ops/rnn.py
Please help.

How to parallelize subgraph executions in TensorFlow?

I am trying to increase my GPU utilization in TensorFlow but I find that subgraph executions are not parallelized.
Here is working example (tensorflow version r.012):
import tensorflow as tf
import numpy as np
from tensorflow.python.client import timeline
#initialize graph
tf.reset_default_graph()
sess = tf.Session()
# some parameters
input_dim = 10000
output_dim = 100
num_hidden = 10000
batch_size = 256
First we create two networks:
#specify two networks with random inputs as data
with tf.device('/gpu:0'):
# first network
with tf.variable_scope('net1'):
tf_data1 = tf.random_normal(shape=[batch_size, input_dim])
w1 = tf.get_variable('w1', shape=[input_dim, num_hidden], dtype=tf.float32)
b1 = tf.get_variable('b1', shape=[num_hidden], dtype=tf.float32)
l1 = tf.add(tf.matmul(tf_data1, w1), b1)
w2 = tf.get_variable('w2', shape=[num_hidden, output_dim], dtype=tf.float32)
result1 = tf.matmul(l1, w2)
# second network
with tf.variable_scope('net2'):
tf_data2 = tf.random_normal(shape=[batch_size, input_dim])
w1 = tf.get_variable('w1', shape=[input_dim, num_hidden], dtype=tf.float32)
b1 = tf.get_variable('b1', shape=[num_hidden], dtype=tf.float32)
l1 = tf.add(tf.matmul(tf_data1, w1), b1)
w2 = tf.get_variable('w2', shape=[num_hidden, output_dim], dtype=tf.float32)
result2 = tf.matmul(l1, w2)
This is what we are interested in:
#the result that we are interested
out = tf.add(result1, result2)
Now we initialize and run the session:
sess.run(tf.global_variables_initializer()) #initialize variables
# run out operation with trace
run_metadata = tf.RunMetadata()
sess.run(out,
options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
run_metadata=run_metadata )
# write trace to file
trace = timeline.Timeline(step_stats=run_metadata.step_stats)
trace_file = open('trace.ctf.json', 'w')
trace_file.write(trace.generate_chrome_trace_format())
In the trace we can see the following:
The first Matmul is for the net1, and the second Matmul is for net2.
Questions:
1 - Since the result1 does not depend on result2 why these operations are not processed in parallel when calling the parent operation ''out''?
2- Am I doing something wrong when defining the graph? From the documentation I understand that Tensorflow does concurrency automatically.
3- Is there any way I can achieve concurrency at this level?
Thanks
Re (1) TensorFlow by default uses a single GPU stream. If you run your code on CPU, you'll see parallelism. To get better GPU utilization it's best to increase your batch size / kernel size instead.
Re (2) your graph seems to be correctly defined. The automatic parallelization mostly applies to the CPU.
Re (3) there is as of 1.0 no way to run multi-compute-stream code on TensorFlow GPU.

How to implement a sliding window in tensorflow?

I have created a sliding window algorithm using numpy that slides over a wav audio file and feeds slices of it to my NN in tensorflow, which detects features in the audio slices. Once tensorflow does its thing, it returns its output to numpy land, where I reassemble the slices into an array of predictions that match each sample position of the original file:
import tensorflow as tf
import numpy as np
import nn
def slide_predict(layers, X, modelPath):
output = None
graph = tf.Graph()
with graph.as_default():
input_layer_size, hidden_layer_size, num_labels = layers
X_placeholder = tf.placeholder(tf.float32, shape=(None, input_layer_size), name='X')
Theta1 = tf.Variable(nn.randInitializeWeights(input_layer_size, hidden_layer_size), name='Theta1')
bias1 = tf.Variable(nn.randInitializeWeights(hidden_layer_size, 1), name='bias1')
Theta2 = tf.Variable(nn.randInitializeWeights(hidden_layer_size, num_labels), name='Theta2')
bias2 = tf.Variable(nn.randInitializeWeights(num_labels, 1), name='bias2')
hypothesis = nn.forward_prop(X_placeholder, Theta1, bias1, Theta2, bias2)
sess = tf.Session(graph=graph)
saver = tf.train.Saver()
init = tf.global_variables_initializer()
sess.run(init)
saver.restore(sess, modelPath)
window_size = layers[0]
pad_amount = (window_size * 2) - (X.shape[0] % window_size)
X = np.pad(X, (pad_amount, 0), 'constant')
for w in range(window_size):
start = w
end = -window_size + w
X_shifted = X[start:end]
X_matrix = X_shifted.reshape((-1, window_size))
prediction = sess.run(hypothesis, feed_dict={X_placeholder: X_matrix})
output = prediction if (output is None) else np.hstack((output, prediction))
sess.close()
output.shape = (X.size, -1)
return output
Unfortunately, this algorithm is quite slow. I placed some logs along the way and by far the slowest portion is the part where I actually run my tensorflow graph. This could be due to the actual tensorflow calculations being slow (if so, I'm probably just SOL), but I am wondering if a large part of the slowness isn't because I am transferring large audio files repeatedly back and forth in and out of tensorflow. So my questions are:
1) Is feeding a placeholder repeatedly like this going to be noticeably slower than feeding it once and calculating the values for X inside tensorflow?
2) If yes, whats the best way to implement a sliding window algorithm inside tensorflow to do this calculation?
The first issue is that your algorithm is has quadratic time complexity in window_size, because of calling np.hstack() in each iteration to build the output array, which copies both the current values of output and prediction into a new array:
for w in range(window_size):
# ...
output = prediction if (output is None) else np.hstack((output, prediction))
Instead of calling np.hstack() in every iteration, it would be more efficient to build a Python list of the prediction arrays, and call np.hstack() on them once, after the loop terminates:
output_list = []
for w in range(window_size):
# ...
prediction = sess.run(...)
output_list.append(prediction)
output = np.hstack(output_list)
The second issue is that feeding large values to TensorFlow can be inefficient, if the amount of computation in the sess.run() call is small, because those values are (currently) copied into C++ (and the results are copied out. One useful strategy for this is to try and move the sliding window loop into the TensorFlow graph, using the tf.map_fn() construct. For example, you could restructure your program as follows:
# NOTE: If you call this function often, you may want to (i) move the `np.pad()`
# into the graph as `tf.pad()`, and (ii) replace `X_t` with a placeholder.
X = np.pad(X, (pad_amount, 0), 'constant')
X_t = tf.convert_to_tensor(X)
def window_func(w):
start = w
end = w - window_size
X_matrix = tf.reshape(X_t[start:end], (-1, window_size))
return nn.forward_prop(X_matrix, Theta1, bias1, Theta2, bias2)
output_t = tf.map_fn(window_func, tf.range(window_size))
# ...
output = sess.run(output_t)