I'm trying to figure out how to operate tensorboard.
I looked at the demo here:
https://www.tensorflow.org/code/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py
It runs well on my laptop.
Much of it makes sense to me.
So, I wrote a simple tensorflow demo:
# tensorboard_demo1.py
import tensorflow as tf
sess = tf.Session()
with tf.name_scope('scope1'):
y1 = tf.constant(22.9) * 1.1
tf.scalar_summary('y1 scalar_summary', y1)
train_writer = tf.train.SummaryWriter('/tmp/tb1',sess.graph)
print('Result:')
# Now I should run the compute graph:
print(sess.run(y1))
train_writer.close()
# done
It seems to run okay.
Next I ran a simple shell command:
tensorboard --log /tmp/tb1
It told me to browse 0.0.0.0:6006
Which I did.
The web page tells me:
No scalar data was found.
How do I enhance my demo so that it logs a scalar-summary which tensorboard will show me?
You must call train_writer.add_summary() to add some data to the log. For example, one common pattern is to use tf.merge_all_summaries() to create a tensor that implicitly incorporates information from all summaries created in the current graph:
# Creates a TensorFlow tensor that includes information from all summaries
# defined in the current graph.
summary_t = tf.merge_all_summaries()
# Computes the current value of all summaries in the current graph.
summary_val = sess.run(summary_t)
# Writes the summary to the log.
train_writer.add_summary(summary_val)
Related
I would like to have a timing profile/trace of a predict call to get an estimate of how fast my model can perform inference.
Right now I'm using:
log_dir="logs/profile/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, profile_batch = 1)
x_test, y_test = next(iter(training_ds))
_ = unet.predict(x_test, verbose=1, callbacks=[tensorboard_callback])
But the profiling tab does not show up in tensorboard. What am I missing here?
First, see if CUPTI is correctly loading. In a terminal you should see something like:
2019-12-13 12:01:47.617853: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcupti.so.10.0
If it didn't find the CUPTI libraries, make sure that your LD_LIBRARY_PATH is set correctly. $ echo $LD_LIBRARY_PATH should return something like:
/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
If this is all set, run the following snippet of code, assuming you have described your model in tensorflow/keras:
# Set up logging.
stamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
logdir = 'logs/trace/%s' % stamp
writer = tf.summary.create_file_writer(logdir)
tf.summary.trace_on(graph=True, profiler=True)
# Forward pass
input, label = next(iter(dataset)) # tf DataSet object
your_model(input)
with writer.as_default():
tf.summary.trace_export(name="model_trace", step=0, profiler_outdir=logdir)
Now, the next final step is critical for viewing the trace in Tensorboard: you have to view Tensorboard in Chrome for it to parse the .trace file correctly.
If you are using virtual environments, make sure you don't mix things up. See my answer to another question.
Also, there are 4 methods to write profiling data. As you tried callbacks before, you should give the other methods a try: Overview of profiling methods.
First of all, make sure that CUPTI is correctly loading. Recently I have tried to find the solution and I used following:
import tensorflow as tf
logs = "../logs/"
for input_data in datas:
with tf.profiler.experimental.Profile(logs):
out_pred = model.predict(input_data)
pass
# Build a graph.
a = tf.constant(5.0)
b = tf.constant(6.0)
c = a * b
# Launch the graph in a session.
sess = tf.compat.v1.Session()
# Evaluate the tensor `c`.
print(sess.run(c))
This above code is taken from tensorflow core r2.0 documentation
But it gives the above error
The thing is
The tensorflow core r2.0 have enabled eager execution by default so doesn't need to write tf.compat.v1.Session() and use .run() function
If we want to use tf.compat.v1.Session() then we need to do thi
tf.compat.v1.disable_eager_execution() in the starting of algorithm. Now we can use tf.compat.v1.Session() and .run() function.
Tensorflow core r2.0 have enabled eager execution by default. so, without changing it
we just have to change our code
# Launch the graph in a session.
with tf.compat.v1.Session() as ses:
# Build a graph.
a = tf.constant(5.0)
b = tf.constant(6.0)
c = a * b
# Evaluate the tensor `c`.
print(ses.run(c))
This gives the output without any errors
And one more thing to make eager execution enable in case then remember it has to be called in the startup of the algorithm
For more please go through documentation
If any issues please feel free to ask.
By the way i am just a beginner in tensorflow and keras.
Thank You !
I'm trying to define a Sagemaker Training Job with an existing Python class. To my understanding, I could create my own container but would rather not deal with container management.
When choosing "Algorithm Source" there is the option of "Your own algorithm source" but nothing is listed under resources. Where does this come from?
I know I could do this through a notebook, but I really want this defined in a job that can be invoked through an endpoint.
As Bruno has said you will have to use a container somewhere, but you can use an existing container to run your own custom tensorflow code.
There is a good example in the sagemaker github for how to do this.
The way this works is you modify your code to have an entry point which takes argparse command line arguments, and then you point a 'Sagemaker Tensorflow estimator' to the entry point. Then when you call fit on the sagemaker estimator it will download the tensorflow container and run your custom code in there.
So you start off with your own custom code that looks something like this
# my_custom_code.py
import tensorflow as tf
import numpy as np
def build_net():
# single fully connected
image_place = tf.placeholder(tf.float32, [None, 28*28])
label_place = tf.placeholder(tf.int32, [None,])
net = tf.layers.dense(image_place, units=1024, activation=tf.nn.relu)
net = tf.layers.dense(net, units=10, activation=None)
return image_place, label_place, net
def process_data():
# load
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
# center
x_train = x_train / 255.0
m = x_train.mean()
x_train = x_train - m
# convert to right types
x_train = x_train.astype(np.float32)
y_train = y_train.astype(np.int32)
# reshape so flat
x_train = np.reshape(x_train, [-1, 28*28])
return x_train, y_train
def train_model(init_learn, epochs):
image_p, label_p, logit = build_net()
x_train, y_train = process_data()
loss = tf.nn.softmax_cross_entropy_with_logits_v2(
logits=logit,
labels=label_p)
optimiser = tf.train.AdamOptimizer(init_learn)
train_step = optimiser.minimize(loss)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for _ in range(epochs):
sess.run(train_step, feed_dict={image_p: x_train, label_p: y_train})
if __name__ == '__main__':
train_model(0.001, 10)
To make it work with sagemaker we need to create a command line entry point, which will allow sagemaker to run it in the container it will download for us eventually.
# entry.py
import argparse
from my_custom_code import train_model
if __name__ == '__main__':
parser = argparse.ArgumentParser(
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument(
'--model_dir',
type=str)
parser.add_argument(
'--init_learn',
type=float)
parser.add_argument(
'--epochs',
type=int)
args = parser.parse_args()
train_model(args.init_learn, args.epochs)
Apart from specifying the arguments my function needs to take, we also need to provide a model_dir argument. This is always required, and is an S3 location which is where an model artifacts will be saved when the training job completes. Note that you don't need to specify what this value is (though you can) as Sagemaker will provide a default location in S3 for you.
So we have modified our code, now we need to actually run it on Sagemaker. Go to the AWS console and fire up a small instance from Sagemaker. Download your custom code to the instance, and then create a jupyter notebook as follows:
# sagemaker_run.ipyb
import sagemaker
from sagemaker.tensorflow import TensorFlow
hyperparameters = {
'epochs': 10,
'init_learn': 0.001}
role = sagemaker.get_execution_role()
source_dir = '/path/to/folder/with/my/code/on/instance'
estimator = TensorFlow(
entry_point='entry.py',
source_dir=source_dir,
train_instance_type='ml.t2.medium',
train_instance_count=1,
hyperparameters=hyperparameters,
role=role,
py_version='py3',
framework_version='1.12.0',
script_mode=True)
estimator.fit()
Running the above will:
Spin up an ml.t2.medium instance
Download the tensorflow 1.12.0 container to the instance
Download any data we specify in fit to the newly created instance in fit (in this case nothing)
Run our code on the instance
upload the model artifacts to model_dir
And that is pretty much it. There is of course a lot not mentioned here but you can:
Download training/testing data from s3
Save checkpoint files, and tensorboard files during training and upload them to s3
The best resource I found was the example I shared but here are all the things I was looking at to get this working:
example code again
documentation
explanation of environment variables
I believe this is not possible as you may refer from this part on SageMaker documentation. A container is needed for providing the capability to run with any language and framework.
The algorithms that are listed in training job creation, are the algorithms you can create in SageMaker -> Training -> Algorithms. But, it's necessary to define a container, which is a specification of how you can do training and predictions. Even if you don't build a container, you will refer to an existing one (using a built-in algorithm) or you will be using an algorithm from marketplace wich someones built that using an image.
I believe you could build an image that attends your needs parting from an existing one.
After you build the image, you can easily use it to automate your training/prediction jobs from lambda. Here is an example.
Also, you can provide as many input channels to your container as you need to load data, in theory, you can pass a channel that refers to a script that you want to load while your container starts. But is an idea that I've just had, depending on your scenario, could be worth a test. Normally, you could have an image which you could customize during the docker build process. So, if have several different scripts, you can create only one image and just parameterize it to use a custom script.
Here you can find a custom image that uses Tensorflow.
Here are listed a lot of examples of building different containers for several frameworks, also, Tensorflow.
I hope it helps, let me know if need more info.
Regards.
I am not sure if this helps you, but you can make use of Tensorflow estimators which is a like an inbuilt container from AWS. You need a training script and requirements.txt file which will contain the dependencies that you may need. You can follow this link for more information Sagemaker TensorFlow estimators documentation
I'm struggling running Tensorflow (v1.1) code multiple times in Jupyter Notebook.
For example, I execute this simple code snippet that creates an encoding layer for a seq2seq model:
# Construct encoder layer (LSTM)
encoder_cell = tf.contrib.rnn.LSTMCell(encoder_hidden_units)
encoder_outputs, encoder_final_state = tf.nn.dynamic_rnn(
encoder_cell, encoder_inputs_embedded,
dtype=tf.float32, time_major=False
)
First time is totally fine, my encoder is created.
However, if I rerun it (no matter the changes I've applied), I get this error:
Attempt to have a second RNNCell use the weights of a variable scope that already has weights
It's very annoying as it forces me to restart the kernel every time I want to change a layer.
Can someone explain me why this happens and how I can fix this ?
Thanks!
You are trying to build the exact same graph twice and therefore TensorFlow complains because the variables already exist in the default graph.
What you could do is to call tf.reset_default_graph() before trying to call the method a second time to ensure you create a new graph when required.
Just in case, I would also suggest using an interactive session as described here in the Start TensorFlow InteractiveSession section:
import tensorflow as tf
sess = tf.InteractiveSession()
I'm working with Seq2Seq example in TensorFlow. I can run the training and see the outputs of perplexity on the development set. It's great!
I just want to add summaries (especially scalar_summary such as perplexity on dev set) to the event file and monitor them in TensorBoard. After reading the documentation, I don't understand how to annotate translate.py with summary ops.
Can anybody can help me with simple pseudo-code?
It looks like translate.py doesn't create a TensorBoard summary log at all. (Part of the reason may be that much of the evaluation happens in Python, rather than in the TensorFlow graph.) Let's see how to add one.
You'll need to create a tf.train.SummaryWriter. Add the following before entering the training loop (here):
summary_writer = tf.train.SummaryWriter("path/to/logs", sess.graph_def)
You'll need to create summary events for the perplexity in each bucket. These values are computed in Python, so you can't use the usual tf.scalar_summary() op. Instead, you'll create a tf.Summary directly by modifying this loop:
perplexity_summary = tf.Summary()
# Run evals on development set and print their perplexity.
for bucket_id in xrange(len(_buckets)):
encoder_inputs, decoder_inputs, target_weights = model.get_batch(
dev_set, bucket_id)
_, eval_loss, _ = model.step(sess, encoder_inputs, decoder_inputs,
target_weights, bucket_id, True)
eval_ppx = math.exp(eval_loss) if eval_loss < 300 else float('inf')
print(" eval: bucket %d perplexity %.2f" % (bucket_id, eval_ppx))
bucket_value = perplexity_summary.value.add()
bucket_value.tag = "peplexity_bucket)%d" % bucket_id
bucket_value.simple_value = eval_ppx
summary_writer.add_summary(perplexity_summary, model.global_step.eval())
You can add other metrics by constructing tf.Summary values yourself and calling summary_writer.add_summary().