I want to test the FIFIQueue, when I use "with tf.device("/device:GPU:0"):", the first time I run, it's just ok, but when I run it twice, error occurred just print cannot assign gpu to fifo_queue_EnqueueMany(the error detail is in the image below), anyone warm-hearted to help me?
enter image description here
enter image description here
Per drpng's note on Tensorflow: using a FIFO queue for code running on GPUs I wouldn't expect FIFOQueue to schedule on a GPU, and indeed wrapping your code in a .py file (to see TF's logging output) and logging device placement confirms that even the first (successful) invocation schedules on a CPU.
In one cell run:
%%writefile go.py
import tensorflow as tf
config = tf.ConfigProto()
#config.allow_soft_placement=True
config.gpu_options.allow_growth = True
config.log_device_placement=True
def go():
Q = tf.FIFOQueue(3, tf.float16)
enq_many = Q.enqueue_many([[0.1, 0.2, 0.3],])
with tf.device('/device:GPU:0'):
with tf.Session(config=config) as sess:
sess.run(enq_many)
print(Q.size().eval())
go()
go()
And in another cell execute the above as:
!python3 go.py
and observe placement.
Uncomment the allow_soft_placement assignment to make the crash go away.
(I do not know why the first execution succeeds even in the face of non-soft-placement when asking FIFOQueue to schedule on the GPU explicitly as in your code's "first time")
Related
I am running code from another repository, but my issue is general so I am posting it here. Running their code, I get the error along the lines of Expected all tensors to be on the same device, found two: cpu and cuda:0. I have already verified that the model is on cuda:0; the issue is that the dataloader object used is not set to the device. Also, the dataset/models I use here are huggingface-transformers models and huggingface datasets.
Here is the relevant block of code where the issue arises:
eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset
eval_dataloader = self.get_eval_dataloader(eval_dataset)
eval_examples = self.eval_examples if eval_examples is None else eval_examples
compute_metrics = self.compute_metrics
self.compute_metrics = None
eval_loop = (self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop)
try:
#this is where the error occurs
output = eval_loop(
eval_dataloader,
description="Evaluation",
prediction_loss_only=True if compute_metrics is None else None,
ignore_keys=ignore_keys,
)
For context, this occurs inside an evaluate() method of a class inheriting from Seq2SeqTrainer from huggingface. I have tried using something like
for i, (inputs, labels) in eval_dataloader:
inputs, labels = inputs.to(device), labels.to(device)
But that doesn't work (it gives an error of Too many values to unpack (expected 2). Is there any other way I can send this dataloader to the GPU? In particular, is there any way I can edit the evaluation_loop method of Transformers Trainer to move the batches to the GPU or something?
As what I know, when I running tensorflow model on python script I could use the follow code snippet to profile the timeline of each block in the model.
from tensorflow.python.client import timeline
options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
batch_positive_score = sess.run([positive_score], feed_dict, options=options, run_metadata=run_metadata)
fetched_timeline = timeline.Timeline(run_metadata.step_stats)
chrome_trace = fetched_timeline.generate_chrome_trace_format()
with open('./result/timeline.json', 'w') as f:
f.write(chrome_trace)
But how to profile a model that loading on tensorflow-serving?
I think you can use tf.profiler, even during Serving because, it is finally a Tensorflow Graph and the changes made during Training (including Profiling, as per my understanding) will be reflected in Serving as well.
Please find the below Tensorflow Code:
# User can control the tracing steps and
# dumping steps. User can also run online profiling during training.
#
# Create options to profile time/memory as well as parameters.
builder = tf.profiler.ProfileOptionBuilder
opts = builder(builder.time_and_memory()).order_by('micros').build()
opts2 = tf.profiler.ProfileOptionBuilder.trainable_variables_parameter()
# Collect traces of steps 10~20, dump the whole profile (with traces of
# step 10~20) at step 20. The dumped profile can be used for further profiling
# with command line interface or Web UI.
with tf.contrib.tfprof.ProfileContext('/tmp/train_dir',
trace_steps=range(10, 20),
dump_steps=[20]) as pctx:
# Run online profiling with 'op' view and 'opts' options at step 15, 18, 20.
pctx.add_auto_profiling('op', opts, [15, 18, 20])
# Run online profiling with 'scope' view and 'opts2' options at step 20.
pctx.add_auto_profiling('scope', opts2, [20])
# High level API, such as slim, Estimator, etc.
train_loop()
After that, we can run the below mentioned commands in the command prompt:
bazel-bin/tensorflow/core/profiler/profiler \
--profile_path=/tmp/train_dir/profile_xx
tfprof> op -select micros,bytes,occurrence -order_by micros
# Profiler ui available at: https://github.com/tensorflow/profiler-ui
python ui.py --profile_context_path=/tmp/train_dir/profile_xx
Code for Visualizing Time and Memory:
# The following example generates a timeline.
tfprof> graph -step -1 -max_depth 100000 -output timeline:outfile=<filename>
generating trace file.
******************************************************
Timeline file is written to <filename>.
Open a Chrome browser, enter URL chrome://tracing and load the timeline file.
******************************************************
Attribute TensorFlow graph running time to your Python codes:
tfprof> code -max_depth 1000 -show_name_regexes .*model_analyzer.*py.* -select micros -account_type_regexes .* -order_by micros
Show your model variables and the number of parameters:
tfprof> scope -account_type_regexes VariableV2 -max_depth 4 -select params
Show the most expensive operation types:
tfprof> op -select micros,bytes,occurrence -order_by micros
Auto-profile:
tfprof> advise
For more detailed information on this , you can refer the below links:
Understand all the classes mentioned in this page =>
https://www.tensorflow.org/api_docs/python/tf/profiler
Code is given in detail in the below link:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/profiler/README.md
I used the following code in Pycharm:
import tensorflow as tf
sess = tf.Session()
a = tf.constant(value=5, name='input_a')
b = tf.constant(value=3, name='input_b')
c = tf.multiply(a,b, name='mult_c')
d = tf.add(a,b, name='add_d')
e = tf.add(c,d, name='add_e')
print(sess.run(e))
writer = tf.summary.FileWriter("./tb_graph", sess.graph)
Then, I pasted following line to the Anaconda Prompt:
tensorboard --logdir=="tb_graph"
I tried both with "" and '' as there were proposed: Tensorboard: No graph definition files were found. and it does nothing for me.
I had similar issue. The issue occurred when I specified 'logdir' folder inside single quotes instead of double quotes. Hope this may be helpful to you.
egs: tensorboard --logdir='my_graph' -> Tensorboard didn't detect the graph
tensorboard --logdir="my_graph" -> Tensorboard detected the graph
I checked the code on laptop with Ubuntu 16.04 and another one with Win10, so it probably isn't system-based error.
I also tried adding and removing --host=127.0.0.1 in An Prompt and checking several times both http://localhost:6006/ and http://desktop-.......:6006/.
Still same error:
No graph definition files were found.
To store a graph, create a tf.summary.FileWriter and pass the graph either via the constructor, or by calling its add_graph() method. You may want to check out the graph visualizer tutorial.
....
Please tell me what is wrong in the code/propmp command?
EDIT: On Ubuntu I used the normal terminal, of course.
EDIT2: I used both = and == in command prompt
The answer to my question is:
1) change "./new1_dir" into ".\\new1_dir"
and
2)put full track to file to anaconda propmpt: --logdir="C:\Users\Admin\Documents\PycharmProjects\try_tb\new1_dir"
Thanks #BugKiller for your help!
EDIT: Working only on Windows for me, but still better than nothing
EDIT2: Works on Ubuntu 16.04 too
Below is a code snippet that I use to monitor events when training a DNNRegressor. I am running from a Jupyter notebook.
During training, I get the following errors in the terminal:
E tensorflow/core/util/events_writer.cc:162] The events file
/Users/eran/Genie/PNP/TB/events.out.tfevents.1473067505.Eran has
disappeared. E tensorflow/core/util/events_writer.cc:131] Failed to
flush 2498 events to
/Users/eran/Genie/PNP/TB/events.out.tfevents.1473067505.Eran
def add_monitors():
validation_metrics = {'MeanSquaredError': tf.contrib.metrics.streaming_mean_squared_error}
monitors = learn.monitors.ValidationMonitor(valid_X, valid_y, every_n_steps=50, metrics=validation_metrics)
return [monitors]
regressor = learn.DNNRegressor(model_dir='/Users/eran/Genie/PNP/TB',
hidden_units=[32,16], feature_columns=learn.infer_real_valued_columns_from_input(X),
optimizer=tf.train.ProximalAdagradOptimizer(learning_rate=0.1),
config=learn.RunConfig(save_checkpoints_secs=1))
monitors = add_monitors()
regressor.fit(X, y, steps=10000, batch_size=20, monitors=monitors)
Any ideas? When opening TensorBoard I do not see any events being recorded
log_dir=path_to_events_file
in your code, weather you add some recreate directory code such as tf.gfile.DeleteRecursively(log_dir);tf.gfile.MakeDirs(log_dir) . this step must be done before any summary writer, otherwise tf would not be able to find the right event file.
If you use Windows, give the directory like this:
model_dir='C:\\Users\\eran\\Genie\\PNP\\TB'
Running Tensorflow and Tensorboard on docker here.
I was trying to write the simplest code to just demonstrate how tensorboard may work:
graph = tf.Graph()
with graph.as_default(), tf.device('/cpu:0'):
a = tf.constant(5.0)
b = tf.constant(6.0)
c = a * b
# Enter data into summary.
c_summary = tf.scalar_summary("c", c)
merged = tf.merge_all_summaries()
with tf.Session(graph=graph) as session:
writer = tf.train.SummaryWriter("log/test_logs", session.graph_def)
result = session.run([merged])
tf.initialize_all_variables().run()
writer.add_summary(result[0], 0)
I then ran tensorboard --logdir={absolute path to log/test_logs} but no event was listed there. Is there anything I should have written differently in the code maybe?
Note that log/test_logs does contain files like events.out.tfevents.1459102927.0a8840dee548.
I am not sure whether it is your case.
SummaryWriter by default will store summaries in its buffer, it will flush every period of time(I guess 120 seconds? Not sure).
So maybe you just did not wait until your the flush happens. Try to manually flush SummaryWriter or just close() it at the end of your program.