How to profile tensorflow model that running on tf-serving? - tensorflow

As what I know, when I running tensorflow model on python script I could use the follow code snippet to profile the timeline of each block in the model.
from tensorflow.python.client import timeline
options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
batch_positive_score = sess.run([positive_score], feed_dict, options=options, run_metadata=run_metadata)
fetched_timeline = timeline.Timeline(run_metadata.step_stats)
chrome_trace = fetched_timeline.generate_chrome_trace_format()
with open('./result/timeline.json', 'w') as f:
f.write(chrome_trace)
But how to profile a model that loading on tensorflow-serving?

I think you can use tf.profiler, even during Serving because, it is finally a Tensorflow Graph and the changes made during Training (including Profiling, as per my understanding) will be reflected in Serving as well.
Please find the below Tensorflow Code:
# User can control the tracing steps and
# dumping steps. User can also run online profiling during training.
#
# Create options to profile time/memory as well as parameters.
builder = tf.profiler.ProfileOptionBuilder
opts = builder(builder.time_and_memory()).order_by('micros').build()
opts2 = tf.profiler.ProfileOptionBuilder.trainable_variables_parameter()
# Collect traces of steps 10~20, dump the whole profile (with traces of
# step 10~20) at step 20. The dumped profile can be used for further profiling
# with command line interface or Web UI.
with tf.contrib.tfprof.ProfileContext('/tmp/train_dir',
trace_steps=range(10, 20),
dump_steps=[20]) as pctx:
# Run online profiling with 'op' view and 'opts' options at step 15, 18, 20.
pctx.add_auto_profiling('op', opts, [15, 18, 20])
# Run online profiling with 'scope' view and 'opts2' options at step 20.
pctx.add_auto_profiling('scope', opts2, [20])
# High level API, such as slim, Estimator, etc.
train_loop()
After that, we can run the below mentioned commands in the command prompt:
bazel-bin/tensorflow/core/profiler/profiler \
--profile_path=/tmp/train_dir/profile_xx
tfprof> op -select micros,bytes,occurrence -order_by micros
# Profiler ui available at: https://github.com/tensorflow/profiler-ui
python ui.py --profile_context_path=/tmp/train_dir/profile_xx
Code for Visualizing Time and Memory:
# The following example generates a timeline.
tfprof> graph -step -1 -max_depth 100000 -output timeline:outfile=<filename>
generating trace file.
******************************************************
Timeline file is written to <filename>.
Open a Chrome browser, enter URL chrome://tracing and load the timeline file.
******************************************************
Attribute TensorFlow graph running time to your Python codes:
tfprof> code -max_depth 1000 -show_name_regexes .*model_analyzer.*py.* -select micros -account_type_regexes .* -order_by micros
Show your model variables and the number of parameters:
tfprof> scope -account_type_regexes VariableV2 -max_depth 4 -select params
Show the most expensive operation types:
tfprof> op -select micros,bytes,occurrence -order_by micros
Auto-profile:
tfprof> advise
For more detailed information on this , you can refer the below links:
Understand all the classes mentioned in this page =>
https://www.tensorflow.org/api_docs/python/tf/profiler
Code is given in detail in the below link:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/profiler/README.md

Related

Tf-agent Actor/Learner: TFUniform ReplayBuffer dimensionality issue - invalid shape of Replay Buffer vs. Actor update

I try to adapt the this tf-agents actor<->learner DQN Atari Pong example to my windows machine using a TFUniformReplayBuffer instead of the ReverbReplayBuffer which only works on linux machine but I face a dimensional issue.
[...]
---> 67 init_buffer_actor.run()
[...]
InvalidArgumentError: {{function_node __wrapped__ResourceScatterUpdate_device_/job:localhost/replica:0/task:0/device:CPU:0}} Must have updates.shape = indices.shape + params.shape[1:] or updates.shape = [], got updates.shape [84,84,4], indices.shape [1], params.shape [1000,84,84,4] [Op:ResourceScatterUpdate]
The problem is as follows: The tf actor tries to access the replay buffer and initialize the it with a certain number random samples of shape (84,84,4) according to this deepmind paper but the replay buffer requires samples of shape (1,84,84,4).
My code is as follows:
def train_pong(
env_name='ALE/Pong-v5',
initial_collect_steps=50000,
max_episode_frames_collect=50000,
batch_size=32,
learning_rate=0.00025,
replay_capacity=1000):
# load atari environment
collect_env = suite_atari.load(
env_name,
max_episode_steps=max_episode_frames_collect,
gym_env_wrappers=suite_atari.DEFAULT_ATARI_GYM_WRAPPERS_WITH_STACKING)
# create tensor specs
observation_tensor_spec, action_tensor_spec, time_step_tensor_spec = (
spec_utils.get_tensor_specs(collect_env))
# create training util
train_step = train_utils.create_train_step()
# calculate no. of actions
num_actions = action_tensor_spec.maximum - action_tensor_spec.minimum + 1
# create agent
agent = dqn_agent.DqnAgent(
time_step_tensor_spec,
action_tensor_spec,
q_network=create_DL_q_network(num_actions),
optimizer=tf.compat.v1.train.RMSPropOptimizer(learning_rate=learning_rate))
# create uniform replay buffer
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
data_spec=agent.collect_data_spec,
batch_size=1,
max_length=replay_capacity)
# observer of replay buffer
rb_observer = replay_buffer.add_batch
# create batch dataset
dataset = replay_buffer.as_dataset(
sample_batch_size=batch_size,
num_steps = 2,
single_deterministic_pass=False).prefetch(3)
# create callable function for actor
experience_dataset_fn = lambda: dataset
# create random policy for buffer init
random_policy = random_py_policy.RandomPyPolicy(collect_env.time_step_spec(),
collect_env.action_spec())
# create initalizer
init_buffer_actor = actor.Actor(
collect_env,
random_policy,
train_step,
steps_per_run=initial_collect_steps,
observers=[replay_buffer.add_batch])
# initialize buffer with random samples
init_buffer_actor.run()
(The approach is using the OpenAI Gym Env as well as the corresponding wrapper functions)
I worked with keras-rl2 and tf-agents without actor<->learner for other atari games to create the DQN and both worked quite well afer a some adaptions. I guess my current code will also work after a few adaptions in the tf-agent libary functions, but that would obviate the purpose of the libary.
My current assumption: The actor<->learner methods are not able to work with the TFUniformReplayBuffer (as I expect them to), due to the missing support of the TFPyEnvironment - or I still have some knowledge shortcomings regarding this tf-agents approach
Previous (successful) attempt:
from tf_agents.environments.tf_py_environment import TFPyEnvironment
tf_collect_env = TFPyEnvironment(collect_env)
init_driver = DynamicStepDriver(
tf_collect_env,
random_policy,
observers=[replay_buffer.add_batch],
num_steps=200)
init_driver.run()
I would be very grateful if someone could explain me what I'm overseeing here.
I fixed it...partly, but the next error is (in my opinion) an architectural problem.
The problem is that the Actor/Learner setup is build on a PyEnvironment whereas the
TFUniformReplayBuffer is using the TFPyEnvironment which ends up in the failure above...
Using the PyUniformReplayBuffer with a converted py-spec solved this problem.
from tf_agents.specs import tensor_spec
# convert agent spec to py-data-spec
py_collect_data_spec = tensor_spec.to_array_spec(agent.collect_data_spec)
# create replay buffer based on the py-data-spec
replay_buffer = py_uniform_replay_buffer.PyUniformReplayBuffer(
data_spec= py_collect_data_spec,
capacity=replay_capacity*batch_size
)
This snippet solved the issue of having an incompatible buffer in the background but ends up in another issue
--> The add_batch function does not work
I found this approach which advises to use either a batched environment or to make the following adaptions for the replay observer (add_batch method).
from tf_agents.utils.nest_utils import batch_nested_array
#********* Adpations add_batch method - START *********#
rb_observer = lambda x: replay_buffer.add_batch(batch_nested_array(x))
#********* Adpations add_batch method - END *********#
# create batch dataset
dataset = replay_buffer.as_dataset(
sample_batch_size=32,
single_deterministic_pass=False)
experience_dataset_fn = lambda: dataset
This helped me to solve the issue regarding this post but now I run into another problem where I need to ask someone of the tf-agents-team...
--> It seems that the Learner/Actor structure is no able to work with another buffer than the ReverbBuffer, because the data-spec which is processed by the PyUniformReplayBuffer sets up a wrong buffer structure...
For anyone who has the same problem: I just created this Github-Issue report to get further answers and/or fix my lack of knowledge.
the full fix is shown below...
--> The dimensionality issue was valid and should indicate the the (uploaded) batched samples are not in the correct shape
--> This issue happens due to the fact that the "add_batch" method loads values with the wrong shape
rb_observer = replay_buffer.add_batch
Long story short, this line should be replaced by
rb_observer = lambda x: replay_buffer.add_batch(batch_nested_array(x))
--> Afterwards the (replay buffer) inputs are of correct shape and the Learner Actor Setup starts training.
The full replay buffer is shown below:
# create buffer for storing experience
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
agent.collect_data_spec,
1,
max_length=1000000)
# create batch dataset
dataset = replay_buffer.as_dataset(
sample_batch_size=32,
num_steps = 2,
single_deterministic_pass=False).prefetch(4)
# create batched nested array input for rb_observer
rb_observer = lambda x: replay_buffer.add_batch(batch_nested_array(x))
# create batched readout of dataset
experience_dataset_fn = lambda: dataset

Using LaBSE deployed to Google Cloud AI Platform

I deployed LaBSE model to AI Platform in the last past few days.
The issue I encounter is the answer of the request is above the limit of 2MB.
Several ideas I had to improve the situation:
make AI Platform return minified (not beautifully formatted) JSON (without spaces and newlines everywhere
make AI Plateform return the results in a binary format
since the answer is composed of ~13 outputs : change it to only one output
Do you know any ways of doing 1) or 2) ?
I spent lost of efforts on 3). I'm sure this one is possible. For example by editing the network before uploading it. Here are stuff I tried so far for that:
VERSION = 'v1'
MODEL = 'labse_2_b'
MODEL_DIR = BUCKET + '/' + MODEL
# Download the model
! wget 'https://tfhub.dev/google/LaBSE/2?tf-hub-format=compressed' \
--output-document='{MODEL}.tar.gz'
! mkdir {MODEL}
! tar -xzvf '{MODEL}.tar.gz' -C {MODEL}
# Attempts to load the model, edit it, and save it:
model.save(export_path, save_format='tf') # ValueError: Model <keras.engine.sequential.Sequential object at 0x7f87e833c650>
# cannot be saved because the input shapes have not been set.
# Usually, input shapes are automatically determined from calling
# `.fit()` or `.predict()`.
# To manually set the shapes, call `model.build(input_shape)`.
model.build(input_shape=(None,)) # cannot find a proper shape
# create a AI Plateform model version:
! gsutil -m cp -r '{MODEL}' {MODEL_DIR} # upload model to Google Cloud Storage
! gcloud ai-platform versions create $VERSION \
--model {MODEL} \
--origin {MODEL_DIR} \
--runtime-version=2.1 \
--framework='tensorflow' \
--python-version=3.7 \
--region="{REGION}"
Could some please help with with that?
Thanks a lot in advance,
EDIT :
For those wondering about this limitation, as in the comments below : Here are some complementary pieces of information:
A short sentence as
"I wish you a pleasant flight and a good meal aboard this plane."
which is just 16 parts of words long:
[101, 146, 34450, 15100, 170, 147508, 48088, 14999, 170, 17072, 66369, 351617, 15272, 69746, 119, 102]
cannot be processed:
Response size too large. Received at least 3220082 bytes; max is 2000000.". Details: "Response size too large. Received at least 3220082 bytes; max is 2000000.

Run twice occurred gpu error on Google Colaboratory

I want to test the FIFIQueue, when I use "with tf.device("/device:GPU:0"):", the first time I run, it's just ok, but when I run it twice, error occurred just print cannot assign gpu to fifo_queue_EnqueueMany(the error detail is in the image below), anyone warm-hearted to help me?
enter image description here
enter image description here
Per drpng's note on Tensorflow: using a FIFO queue for code running on GPUs I wouldn't expect FIFOQueue to schedule on a GPU, and indeed wrapping your code in a .py file (to see TF's logging output) and logging device placement confirms that even the first (successful) invocation schedules on a CPU.
In one cell run:
%%writefile go.py
import tensorflow as tf
config = tf.ConfigProto()
#config.allow_soft_placement=True
config.gpu_options.allow_growth = True
config.log_device_placement=True
def go():
Q = tf.FIFOQueue(3, tf.float16)
enq_many = Q.enqueue_many([[0.1, 0.2, 0.3],])
with tf.device('/device:GPU:0'):
with tf.Session(config=config) as sess:
sess.run(enq_many)
print(Q.size().eval())
go()
go()
And in another cell execute the above as:
!python3 go.py
and observe placement.
Uncomment the allow_soft_placement assignment to make the crash go away.
(I do not know why the first execution succeeds even in the face of non-soft-placement when asking FIFOQueue to schedule on the GPU explicitly as in your code's "first time")

Why tensor_summary doesn't work?

I use tf.summary.tensor_summary in my code, following this: https://www.tensorflow.org/api_docs/python/tf/summary/tensor_summary
But I didn't see anything new in tensorboard, and tensorboard also printed some warnings:
W0321 12:50:47.244003 Reloader tf_logging.py:121] This summary with tag 'component_3_finalize/wealth_tensor' is oddly not associated with a plugin.
How to make this work? Do I need install some plugin? I didn't find any docs on this.
UPDATE:
So I here is how I create my summary:
def finalize_graph(self, graph_building_context):
tf.summary.scalar('loss', loss)
# the wealth tensor is of shape [B], where B is the batch size at runtime
tf.summary.scalar('wealth', tf.reduce_mean(wealth))
# uhmm, this tensor_summary doesn't work yet
tf.summary.tensor_summary('wealth_tensor', wealth)
Then I use a MonitoredTrainingSession, by default it will save the summary, and I can see my loss and wealth scalar summry, but not this wealth_tensor summary.

TensorFlow: Opening log data written by SummaryWriter

After following this tutorial on summaries and TensorBoard, I've been able to successfully save and look at data with TensorBoard. Is it possible to open this data with something other than TensorBoard?
By the way, my application is to do off-policy learning. I'm currently saving each state-action-reward tuple using SummaryWriter. I know I could manually store/train on this data, but I thought it'd be nice to use TensorFlow's built in logging features to store/load this data.
As of March 2017, the EventAccumulator tool has been moved from Tensorflow core to the Tensorboard Backend. You can still use it to extract data from Tensorboard log files as follows:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
event_acc = EventAccumulator('/path/to/summary/folder')
event_acc.Reload()
# Show all tags in the log file
print(event_acc.Tags())
# E. g. get wall clock, number of steps and value for a scalar 'Accuracy'
w_times, step_nums, vals = zip(*event_acc.Scalars('Accuracy'))
Easy, the data can actually be exported to a .csv file within TensorBoard under the Events tab, which can e.g. be loaded in a Pandas dataframe in Python. Make sure you check the Data download links box.
For a more automated approach, check out the TensorBoard readme:
If you'd like to export data to visualize elsewhere (e.g. iPython
Notebook), that's possible too. You can directly depend on the
underlying classes that TensorBoard uses for loading data:
python/summary/event_accumulator.py (for loading data from a single
run) or python/summary/event_multiplexer.py (for loading data from
multiple runs, and keeping it organized). These classes load groups of
event files, discard data that was "orphaned" by TensorFlow crashes,
and organize the data by tag.
As another option, there is a script
(tensorboard/scripts/serialize_tensorboard.py) which will load a
logdir just like TensorBoard does, but write all of the data out to
disk as json instead of starting a server. This script is setup to
make "fake TensorBoard backends" for testing, so it is a bit rough
around the edges.
I think the data are encoded protobufs RecordReader format. To get serialized strings out of files you can use py_record_reader or build a graph with TFRecordReader op, and to deserialize those strings to protobuf use Event schema. If you get a working example, please update this q, since we seem to be missing documentation on this.
I did something along these lines for a previous project. As mentioned by others, the main ingredient is tensorflows event accumulator
from tensorflow.python.summary import event_accumulator as ea
acc = ea.EventAccumulator("folder/containing/summaries/")
acc.Reload()
# Print tags of contained entities, use these names to retrieve entities as below
print(acc.Tags())
# E. g. get all values and steps of a scalar called 'l2_loss'
xy_l2_loss = [(s.step, s.value) for s in acc.Scalars('l2_loss')]
# Retrieve images, e. g. first labeled as 'generator'
img = acc.Images('generator/image/0')
with open('img_{}.png'.format(img.step), 'wb') as f:
f.write(img.encoded_image_string)
You can also use the tf.train.summaryiterator: To extract events in a ./logs-Folder where only classic scalars lr, acc, loss, val_acc and val_loss are present you can use this GIST: tensorboard_to_csv.py
Chris Cundy's answer works well when you have less than 10000 data points in your tfevent file. However, when you have a large file with over 10000 data points, Tensorboard will automatically sampling them and only gives you at most 10000 points. It is a quite annoying underlying behavior as it is not well-documented. See https://github.com/tensorflow/tensorboard/blob/master/tensorboard/backend/event_processing/event_accumulator.py#L186.
To get around it and get all data points, a bit hacky way is to:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
class FalseDict(object):
def __getitem__(self,key):
return 0
def __contains__(self, key):
return True
event_acc = EventAccumulator('path/to/your/tfevents',size_guidance=FalseDict())
It looks like for tb version >=2.3 you can streamline the process of converting your tb events to a pandas dataframe using tensorboard.data.experimental.ExperimentFromDev().
It requires you to upload your logs to TensorBoard.dev, though, which is public. There are plans to expand the capability to locally stored logs in the future.
https://www.tensorflow.org/tensorboard/dataframe_api
You can also use the EventFileLoader to iterate through a tensorboard file
from tensorboard.backend.event_processing.event_file_loader import EventFileLoader
for event in EventFileLoader('path/to/events.out.tfevents.xxx').Load():
print(event)
Surprisingly, the python package tb_parse has not been mentioned yet.
From documentation:
Installation:
pip install tensorflow # or tensorflow-cpu pip install -U tbparse # requires Python >= 3.7
Note: If you don't want to install TensorFlow, see Installing without TensorFlow.
We suggest using an additional virtual environment for parsing and plotting the tensorboard events. So no worries if your training code uses Python 3.6 or older versions.
Reading one or more event files with tbparse only requires 5 lines of code:
from tbparse import SummaryReader
log_dir = "<PATH_TO_EVENT_FILE_OR_DIRECTORY>"
reader = SummaryReader(log_dir)
df = reader.scalars
print(df)