RuntimeError: CUDA out of memory in training with pytorch "Pose2Seg" - tensorflow

When I run this code https://github.com/erezposner/Pose2Seg
And I made all steps in this tutorial https://towardsdatascience.com/detection-free-human-instance-segmentation-using-pose2seg-and-pytorch-72f48dc4d23e
but I have this error in cuda:
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 4.00 GiB total capacity; 2.57 GiB already allocated; 74.77 MiB free; 2.85 GiB reserved in total by PyTorch) (malloc at ..\c10\cuda\CUDACachingAllocator.cpp:289) (no backtrace available)
How can I solve this?
(base) C:\Users\ASUS\Pose2Seg>python train.py
06-23 07:30:01 ===========> loading model <===========
total params in model is 334, in pretrained model is 336, init 334
06-23 07:30:03 ===========> loading data <===========
loading annotations into memory...
Done (t=4.56s)
creating index...
index created!
06-23 07:30:08 ===========> set optimizer <===========
06-23 07:30:08 ===========> training <===========
C:\Users\ASUS\Anaconda3\Anaconda\lib\site-packages\torch\nn\functional.py:2796: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.
warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.")
C:\Users\ASUS\Anaconda3\Anaconda\lib\site-packages\torch\nn\functional.py:2973: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
"See the documentation of nn.Upsample for details.".format(mode))
C:\Users\ASUS\Anaconda3\Anaconda\lib\site-packages\torch\nn\functional.py:3289: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
warnings.warn("Default grid_sample and affine_grid behavior has changed "
C:\Users\ASUS\Anaconda3\Anaconda\lib\site-packages\torch\nn\functional.py:3226: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
warnings.warn("Default grid_sample and affine_grid behavior has changed "
06-23 07:30:13 Epoch: [0][0/56599] Lr: [6.68e-05] Time 4.228 (4.228) Data 0.028 (0.028) loss 0.85738 (0.85738)
06-23 07:30:22 Epoch: [0][10/56599] Lr: [6.813333333333334e-05] Time 0.847 (1.280) Data 0.012 (0.051) loss 0.44195 (0.71130)
06-23 07:30:33 Epoch: [0][20/56599] Lr: [6.946666666666667e-05] Time 0.882 (1.180) Data 0.045 (0.037) loss 0.41523 (0.60743)
Traceback (most recent call last):
File "train.py", line 157, in <module>
optimizer, epoch, iteration)
File "train.py", line 74, in train
loss.backward()
File "C:\Users\ASUS\Anaconda3\Anaconda\lib\site-packages\torch\tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Users\ASUS\Anaconda3\Anaconda\lib\site-packages\torch\autograd\__init__.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 4.00 GiB total capacity; 2.57 GiB already allocated; 74.77 MiB free; 2.85 GiB reserved in total by PyTorch) (malloc at ..\c10\cuda\CUDACachingAllocator.cpp:289)
(no backtrace available)
cudatoolkit == 10.1.243
python3.6.5
The version of libs:
>>> import tensorflow
2020-06-23 09:45:01.840827: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
>>> tensorflow.__version__
'2.2.0'
>>> import keras
Using TensorFlow backend.
>>> keras.__version__
'2.3.1'
>>> import torch
>>> torch.__version__
'1.5.1'
>>> import torchvision
>>> torchvision.__version__
'0.6.1'
>>> import pycocotools
train.py code
import os
import sys
import time
import logging
import argparse
import numpy as np
from tqdm import tqdm
import torch
import torch.utils.data
from lib.averageMeter import AverageMeters
from lib.logger import colorlogger
from lib.timer import Timers
from lib.averageMeter import AverageMeters
from lib.torch_utils import adjust_learning_rate
import os
from modeling.build_model import Pose2Seg
from datasets.CocoDatasetInfo import CocoDatasetInfo, annToMask
from test import test
NAME = "release_base"
# Set `LOG_DIR` and `SNAPSHOT_DIR`
def setup_logdir():
timestamp = time.strftime("%Y-%m-%d_%H_%M_%S", time.localtime())
LOGDIR = os.path.join(os.getcwd(), 'logs', '%s_%s' % (NAME, timestamp))
SNAPSHOTDIR = os.path.join(
os.getcwd(), 'snapshot', '%s_%s' % (NAME, timestamp))
if not os.path.exists(LOGDIR):
os.makedirs(LOGDIR)
if not os.path.exists(SNAPSHOTDIR):
os.makedirs(SNAPSHOTDIR)
return LOGDIR, SNAPSHOTDIR
LOGDIR, SNAPSHOTDIR = setup_logdir()
# Set logging
logger = colorlogger(log_dir=LOGDIR, log_name='train_logs.txt')
# Set Global Timer
timers = Timers()
# Set Global AverageMeter
averMeters = AverageMeters()
def train(model, dataloader, optimizer, epoch, iteration):
# switch to train mode
model.train()
averMeters.clear()
end = time.time()
for i, inputs in enumerate(dataloader):
averMeters['data_time'].update(time.time() - end)
iteration += 1
lr = adjust_learning_rate(optimizer, iteration, BASE_LR=0.0002,
WARM_UP_FACTOR=1.0/3, WARM_UP_ITERS=1000,
STEPS=(0, 14150*15, 14150*20), GAMMA=0.1)
# forward
outputs = model(**inputs)
# loss
loss = outputs
# backward
averMeters['loss'].update(loss.data.item())
optimizer.zero_grad()
loss.backward()
optimizer.step()
# measure elapsed time
averMeters['batch_time'].update(time.time() - end)
end = time.time()
if i % 10 == 0:
logger.info('Epoch: [{0}][{1}/{2}]\t'
'Lr: [{3}]\t'
'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
'Data {data_time.val:.3f} ({data_time.avg:.3f})\t'
'loss {loss.val:.5f} ({loss.avg:.5f})\t'
.format(
epoch, i, len(dataloader), lr,
batch_time=averMeters['batch_time'], data_time=averMeters['data_time'],
loss=averMeters['loss'])
)
if i % 10000 == 0:
torch.save(model.state_dict(), os.path.join(
SNAPSHOTDIR, '%d_%d.pkl' % (epoch, i)))
torch.save(model.state_dict(), os.path.join(
SNAPSHOTDIR, 'last.pkl'))
return iteration
class Dataset():
def __init__(self):
ImageRoot = r'C:\Users\ASUS\Pose2Seg\data\coco2017\train2017'
AnnoFile = r'C:\Users\ASUS\Pose2Seg\data\coco2017\annotations\person_keypoints_train2017_pose2seg.json'
self.datainfos = CocoDatasetInfo(
ImageRoot, AnnoFile, onlyperson=True, loadimg=True)
def __len__(self):
return len(self.datainfos)
def __getitem__(self, idx):
rawdata = self.datainfos[idx]
img = rawdata['data']
image_id = rawdata['id']
height, width = img.shape[0:2]
gt_kpts = np.float32(rawdata['gt_keypoints']).transpose(
0, 2, 1) # (N, 17, 3)
gt_segms = rawdata['segms']
gt_masks = np.array([annToMask(segm, height, width)
for segm in gt_segms])
return {'img': img, 'kpts': gt_kpts, 'masks': gt_masks}
def collate_fn(self, batch):
batchimgs = [data['img'] for data in batch]
batchkpts = [data['kpts'] for data in batch]
batchmasks = [data['masks'] for data in batch]
return {'batchimgs': batchimgs, 'batchkpts': batchkpts, 'batchmasks': batchmasks}
if __name__ == '__main__':
logger.info('===========> loading model <===========')
model = Pose2Seg().cuda()
# model.init("")
model.train()
logger.info('===========> loading data <===========')
datasetTrain = Dataset()
dataloaderTrain = torch.utils.data.DataLoader(datasetTrain, batch_size=1, shuffle=True,
num_workers=0, pin_memory=False,
collate_fn=datasetTrain.collate_fn)
logger.info('===========> set optimizer <===========')
''' set your optimizer like this. Normally is Adam/SGD. '''
#optimizer = torch.optim.SGD(model.parameters(), 0.0002, momentum=0.9, weight_decay=0.0005)
optimizer = torch.optim.Adam(
model.parameters(), 0.0002, weight_decay=0.0000)
iteration = 0
epoch = 0
try:
while iteration < 14150*25:
logger.info('===========> training <===========')
iteration = train(model, dataloaderTrain,
optimizer, epoch, iteration)
epoch += 1
logger.info('===========> testing <===========')
test(model, dataset='cocoVal', logger=logger.info)
test(model, dataset='OCHumanVal', logger=logger.info)
except (KeyboardInterrupt):
logger.info('Save ckpt on exception ...')
torch.save(model.state_dict(), os.path.join(
SNAPSHOTDIR, 'interrupt_%d_%d.pkl' % (epoch, iteration)))
logger.info('Save ckpt done.')

Your GPU doesn't have enough memory. Try to reduce the batch size. If still the same, try to reduce input image size. It should work fine then.
By the way, for this type of model, 8GB of GPU memory is recommended.

Related

tflite_model_maker if obj['difficult'] == 'Unspecified': KeyError: 'difficult'

i am trying to train a tflite model using just people in coco dataset.
I am using tflite model maker to train and fiftyone to process dataset.
when running the training file .py i get the error below.
root#85ac26b47f92:/external# root#85ac26b47f92:/external# python demofie.py
2022-11-01 21:02:01.059188: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: UNKNOWN ERROR (34)
2022-11-01 21:02:01.059234: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 85ac26b47f92
2022-11-01 21:02:01.059242: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 85ac26b47f92
2022-11-01 21:02:01.059324: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2022-11-01 21:02:01.059381: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3
2022-11-01 21:02:01.059821: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
File "demofie.py", line 20, in <module>
train_data = object_detector.DataLoader.from_pascal_voc(images_dir='/external/train/data',annotations_dir='/external/train/labels', label_map=['person'],ignore_difficult_instances= False,num_shards = 100)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_examples/lite/model_maker/core/data_util/object_detector_dataloader.py", line 217, in from_pascal_voc
cache_writer.write_files(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_examples/lite/model_maker/core/data_util/object_detector_dataloader_util.py", line 252, in write_files
tf_example = create_pascal_tfrecord.dict_to_tf_example(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_examples/lite/model_maker/third_party/efficientdet/dataset/create_pascal_tfrecord.py", line 162, in dict_to_tf_example
if obj['difficult'] == 'Unspecified':
KeyError: 'difficult'
code that causes the error. can anyone with better coding knowledge than me shed some light on any mistakes i may have made.
I have added the fiftyone code below this (no error)
import numpy as np
import os
from tflite_model_maker.config import QuantizationConfig
from tflite_model_maker.config import ExportFormat
from tflite_model_maker import model_spec
from tflite_model_maker import object_detector
import tensorflow as tf
assert tf.__version__.startswith('2')
tf.get_logger().setLevel('ERROR')
from absl import logging
logging.set_verbosity(logging.ERROR)
spec = model_spec.get('efficientdet_lite1')
train_data = object_detector.DataLoader.from_pascal_voc(images_dir='/external/train/data',annotations_dir='/external/train/labels', label_map=['person'],ignore_difficult_instances= False,num_shards = 100)
validation_data = object_detector.DataLoader.from_pascal_voc(images_dir='/external/val/data',annotations_dir='/external/val/labels',label_map= ['person'],ignore_difficult_instances= False,num_shards = 100)
test_data = object_detector.DataLoader.from_pascal_voc(images_dir='/external/test/data',annotations_dir='/external/test/labels',label_map= ['person'],ignore_difficult_instances= False,num_shards = 100)
model = object_detector.create(train_data, model_spec=spec, batch_size=8,epochs=2000, train_whole_model=True, validation_data=validation_data)
model.evaluate(test_data)
model.export(export_dir='/external/')
**dataset generation code
**
import fiftyone.zoo as foz
import fiftyone as fo
from fiftyone import ViewField as F
cocodataset_test = foz.load_zoo_dataset(
"coco-2017",
splits="test",
label_types=["detections"],
classes=["person"],
only_matching=True,
# max_samples=50,
)
cocodataset_validation = foz.load_zoo_dataset(
"coco-2017",
splits="validation",
label_types=["detections"],
classes=["person"],
only_matching=True,
# max_samples=50
)
cocodataset_train = foz.load_zoo_dataset(
"coco-2017",
splits="train",
label_types=["detections"],
classes=["person"],
only_matching=True,
# max_samples=50,
)
cocodataset_validation.export(
'/external/val',
fo.types.VOCDetectionDataset,
)
cocodataset_train.export(
'/external/train/',
fo.types.VOCDetectionDataset,
)
cocodataset_test.export(
'/external/test/',
fo.types.VOCDetectionDataset,
)

Using BatchedPyEnvironment in tf_agents

I am trying to create a batched environment version of an SAC agent example from the Tensorflow Agents library, the original code can be found here. I am also using a custom environment.
I am pursuing a batched environment setup in order to better leverage GPU resources in order to speed up training. My understanding is that by passing batches of trajectories to the GPU, there will be less overhead incurred when passing data from the host (CPU) to the device (GPU).
My custom environment is called SacEnv, and I attempt to create a batched environment like so:
py_envs = [SacEnv() for _ in range(0, batch_size)]
batched_env = batched_py_environment.BatchedPyEnvironment(envs=py_envs)
tf_env = tf_py_environment.TFPyEnvironment(batched_env)
My hope is that this will create a batched environment consisting of a 'batch' of non-batched environments. However I am receiving the following error when running the code:
ValueError: Cannot assign value to variable ' Accumulator:0': Shape mismatch.The variable shape (1,), and the assigned value shape (32,) are incompatible.
with the stack trace:
Traceback (most recent call last):
File "/home/gary/Desktop/code/sac_test/sac_main2.py", line 370, in <module>
app.run(main)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/gary/Desktop/code/sac_test/sac_main2.py", line 366, in main
train_eval(FLAGS.root_dir)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/home/gary/Desktop/code/sac_test/sac_main2.py", line 274, in train_eval
results = metric_utils.eager_compute(
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/eval/metric_utils.py", line 163, in eager_compute
common.function(driver.run)(time_step, policy_state)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 211, in run
return self._run_fn(
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/utils/common.py", line 188, in with_check_resource_vars
return fn(*fn_args, **fn_kwargs)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 238, in _run
tf.while_loop(
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 154, in loop_body
observer_ops = [observer(traj) for observer in self._observers]
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 154, in <listcomp>
observer_ops = [observer(traj) for observer in self._observers]
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metric.py", line 93, in __call__
return self._update_state(*args, **kwargs)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metric.py", line 81, in _update_state
return self.call(*arg, **kwargs)
ValueError: in user code:
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metrics.py", line 176, in call *
self._return_accumulator.assign(
ValueError: Cannot assign value to variable ' Accumulator:0': Shape mismatch.The variable shape (1,), and the assigned value shape (32,) are incompatible.
In call to configurable 'eager_compute' (<function eager_compute at 0x7fa4d6e5e040>)
In call to configurable 'train_eval' (<function train_eval at 0x7fa4c8622dc0>)
I have dug through the tf_metric.py code to try and understand the error, however I have been unsuccessful. A related issue was solved when I added the batch size (32) to the initializer for the AverageReturnMetric instance, and this issue seems related.
The full code is:
# coding=utf-8
# Copyright 2020 The TF-Agents Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Lint as: python2, python3
r"""Train and Eval SAC.
All hyperparameters come from the SAC paper
https://arxiv.org/pdf/1812.05905.pdf
To run:
```bash
tensorboard --logdir $HOME/tmp/sac/gym/HalfCheetah-v2/ --port 2223 &
python tf_agents/agents/sac/examples/v2/train_eval.py \
--root_dir=$HOME/tmp/sac/gym/HalfCheetah-v2/ \
--alsologtostderr
\```
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from sac_env import SacEnv
import os
import time
from absl import app
from absl import flags
from absl import logging
import gin
from six.moves import range
import tensorflow as tf # pylint: disable=g-explicit-tensorflow-version-import
from tf_agents.agents.ddpg import critic_network
from tf_agents.agents.sac import sac_agent
from tf_agents.agents.sac import tanh_normal_projection_network
from tf_agents.drivers import dynamic_step_driver
#from tf_agents.environments import suite_mujoco
from tf_agents.environments import tf_py_environment
from tf_agents.environments import batched_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import actor_distribution_network
from tf_agents.policies import greedy_policy
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.utils import common
from tf_agents.train.utils import strategy_utils
flags.DEFINE_string('root_dir', os.getenv('TEST_UNDECLARED_OUTPUTS_DIR'),
'Root directory for writing logs/summaries/checkpoints.')
flags.DEFINE_multi_string('gin_file', None, 'Path to the trainer config files.')
flags.DEFINE_multi_string('gin_param', None, 'Gin binding to pass through.')
FLAGS = flags.FLAGS
gpus = tf.config.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
print(e)
#gin.configurable
def train_eval(
root_dir,
env_name='SacEnv',
# The SAC paper reported:
# Hopper and Cartpole results up to 1000000 iters,
# Humanoid results up to 10000000 iters,
# Other mujoco tasks up to 3000000 iters.
num_iterations=3000000,
actor_fc_layers=(256, 256),
critic_obs_fc_layers=None,
critic_action_fc_layers=None,
critic_joint_fc_layers=(256, 256),
# Params for collect
# Follow https://github.com/haarnoja/sac/blob/master/examples/variants.py
# HalfCheetah and Ant take 10000 initial collection steps.
# Other mujoco tasks take 1000.
# Different choices roughly keep the initial episodes about the same.
#initial_collect_steps=10000,
initial_collect_steps=2000,
collect_steps_per_iteration=1,
replay_buffer_capacity=31250, # 1000000 / 32
# Params for target update
target_update_tau=0.005,
target_update_period=1,
# Params for train
train_steps_per_iteration=1,
#batch_size=256,
batch_size=32,
actor_learning_rate=3e-4,
critic_learning_rate=3e-4,
alpha_learning_rate=3e-4,
td_errors_loss_fn=tf.math.squared_difference,
gamma=0.99,
reward_scale_factor=0.1,
gradient_clipping=None,
use_tf_functions=True,
# Params for eval
num_eval_episodes=30,
eval_interval=10000,
# Params for summaries and logging
train_checkpoint_interval=50000,
policy_checkpoint_interval=50000,
rb_checkpoint_interval=50000,
log_interval=1000,
summary_interval=1000,
summaries_flush_secs=10,
debug_summaries=False,
summarize_grads_and_vars=False,
eval_metrics_callback=None):
"""A simple train and eval for SAC."""
root_dir = os.path.expanduser(root_dir)
train_dir = os.path.join(root_dir, 'train')
eval_dir = os.path.join(root_dir, 'eval')
train_summary_writer = tf.compat.v2.summary.create_file_writer(
train_dir, flush_millis=summaries_flush_secs * 1000)
train_summary_writer.set_as_default()
eval_summary_writer = tf.compat.v2.summary.create_file_writer(
eval_dir, flush_millis=summaries_flush_secs * 1000)
eval_metrics = [
tf_metrics.AverageReturnMetric(buffer_size=num_eval_episodes),
tf_metrics.AverageEpisodeLengthMetric(buffer_size=num_eval_episodes)
]
global_step = tf.compat.v1.train.get_or_create_global_step()
with tf.compat.v2.summary.record_if(
lambda: tf.math.equal(global_step % summary_interval, 0)):
py_envs = [SacEnv() for _ in range(0, batch_size)]
batched_env = batched_py_environment.BatchedPyEnvironment(envs=py_envs)
tf_env = tf_py_environment.TFPyEnvironment(batched_env)
eval_py_envs = [SacEnv() for _ in range(0, batch_size)]
eval_batched_env = batched_py_environment.BatchedPyEnvironment(envs=eval_py_envs)
eval_tf_env = tf_py_environment.TFPyEnvironment(eval_batched_env)
time_step_spec = tf_env.time_step_spec()
observation_spec = time_step_spec.observation
action_spec = tf_env.action_spec()
strategy = strategy_utils.get_strategy(tpu=False, use_gpu=True)
with strategy.scope():
actor_net = actor_distribution_network.ActorDistributionNetwork(
observation_spec,
action_spec,
fc_layer_params=actor_fc_layers,
continuous_projection_net=tanh_normal_projection_network
.TanhNormalProjectionNetwork)
critic_net = critic_network.CriticNetwork(
(observation_spec, action_spec),
observation_fc_layer_params=critic_obs_fc_layers,
action_fc_layer_params=critic_action_fc_layers,
joint_fc_layer_params=critic_joint_fc_layers,
kernel_initializer='glorot_uniform',
last_kernel_initializer='glorot_uniform')
tf_agent = sac_agent.SacAgent(
time_step_spec,
action_spec,
actor_network=actor_net,
critic_network=critic_net,
actor_optimizer=tf.compat.v1.train.AdamOptimizer(
learning_rate=actor_learning_rate),
critic_optimizer=tf.compat.v1.train.AdamOptimizer(
learning_rate=critic_learning_rate),
alpha_optimizer=tf.compat.v1.train.AdamOptimizer(
learning_rate=alpha_learning_rate),
target_update_tau=target_update_tau,
target_update_period=target_update_period,
td_errors_loss_fn=td_errors_loss_fn,
gamma=gamma,
reward_scale_factor=reward_scale_factor,
gradient_clipping=gradient_clipping,
debug_summaries=debug_summaries,
summarize_grads_and_vars=summarize_grads_and_vars,
train_step_counter=global_step)
tf_agent.initialize()
# Make the replay buffer.
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
data_spec=tf_agent.collect_data_spec,
batch_size=batch_size,
max_length=replay_buffer_capacity,
device="/device:GPU:0")
replay_observer = [replay_buffer.add_batch]
train_metrics = [
tf_metrics.NumberOfEpisodes(),
tf_metrics.EnvironmentSteps(),
tf_metrics.AverageReturnMetric(
buffer_size=num_eval_episodes, batch_size=tf_env.batch_size),
tf_metrics.AverageEpisodeLengthMetric(
buffer_size=num_eval_episodes, batch_size=tf_env.batch_size),
]
eval_policy = greedy_policy.GreedyPolicy(tf_agent.policy)
initial_collect_policy = random_tf_policy.RandomTFPolicy(
tf_env.time_step_spec(), tf_env.action_spec())
collect_policy = tf_agent.collect_policy
train_checkpointer = common.Checkpointer(
ckpt_dir=train_dir,
agent=tf_agent,
global_step=global_step,
metrics=metric_utils.MetricsGroup(train_metrics, 'train_metrics'))
policy_checkpointer = common.Checkpointer(
ckpt_dir=os.path.join(train_dir, 'policy'),
policy=eval_policy,
global_step=global_step)
rb_checkpointer = common.Checkpointer(
ckpt_dir=os.path.join(train_dir, 'replay_buffer'),
max_to_keep=1,
replay_buffer=replay_buffer)
train_checkpointer.initialize_or_restore()
rb_checkpointer.initialize_or_restore()
initial_collect_driver = dynamic_step_driver.DynamicStepDriver(
tf_env,
initial_collect_policy,
observers=replay_observer + train_metrics,
num_steps=initial_collect_steps)
collect_driver = dynamic_step_driver.DynamicStepDriver(
tf_env,
collect_policy,
observers=replay_observer + train_metrics,
num_steps=collect_steps_per_iteration)
if use_tf_functions:
initial_collect_driver.run = common.function(initial_collect_driver.run)
collect_driver.run = common.function(collect_driver.run)
tf_agent.train = common.function(tf_agent.train)
if replay_buffer.num_frames() == 0:
# Collect initial replay data.
logging.info(
'Initializing replay buffer by collecting experience for %d steps '
'with a random policy.', initial_collect_steps)
initial_collect_driver.run()
results = metric_utils.eager_compute(
eval_metrics,
eval_tf_env,
eval_policy,
num_episodes=num_eval_episodes,
train_step=global_step,
summary_writer=eval_summary_writer,
summary_prefix='Metrics',
)
if eval_metrics_callback is not None:
eval_metrics_callback(results, global_step.numpy())
metric_utils.log_metrics(eval_metrics)
time_step = None
policy_state = collect_policy.get_initial_state(tf_env.batch_size)
timed_at_step = global_step.numpy()
time_acc = 0
# Prepare replay buffer as dataset with invalid transitions filtered.
def _filter_invalid_transition(trajectories, unused_arg1):
return ~trajectories.is_boundary()[0]
dataset = replay_buffer.as_dataset(
sample_batch_size=batch_size,
num_steps=2).unbatch().filter(
_filter_invalid_transition).batch(batch_size).prefetch(5)
# Dataset generates trajectories with shape [Bx2x...]
iterator = iter(dataset)
def train_step():
experience, _ = next(iterator)
return tf_agent.train(experience)
if use_tf_functions:
train_step = common.function(train_step)
global_step_val = global_step.numpy()
while global_step_val < num_iterations:
start_time = time.time()
time_step, policy_state = collect_driver.run(
time_step=time_step,
policy_state=policy_state,
)
for _ in range(train_steps_per_iteration):
train_loss = train_step()
time_acc += time.time() - start_time
global_step_val = global_step.numpy()
if global_step_val % log_interval == 0:
logging.info('step = %d, loss = %f', global_step_val,
train_loss.loss)
steps_per_sec = (global_step_val - timed_at_step) / time_acc
logging.info('%.3f steps/sec', steps_per_sec)
tf.compat.v2.summary.scalar(
name='global_steps_per_sec', data=steps_per_sec, step=global_step)
timed_at_step = global_step_val
time_acc = 0
for train_metric in train_metrics:
train_metric.tf_summaries(
train_step=global_step, step_metrics=train_metrics[:2])
if global_step_val % eval_interval == 0:
results = metric_utils.eager_compute(
eval_metrics,
eval_tf_env,
eval_policy,
num_episodes=num_eval_episodes,
train_step=global_step,
summary_writer=eval_summary_writer,
summary_prefix='Metrics',
)
if eval_metrics_callback is not None:
eval_metrics_callback(results, global_step_val)
metric_utils.log_metrics(eval_metrics)
if global_step_val % train_checkpoint_interval == 0:
train_checkpointer.save(global_step=global_step_val)
if global_step_val % policy_checkpoint_interval == 0:
policy_checkpointer.save(global_step=global_step_val)
if global_step_val % rb_checkpoint_interval == 0:
rb_checkpointer.save(global_step=global_step_val)
return train_loss
def main(_):
tf.compat.v1.enable_v2_behavior()
logging.set_verbosity(logging.INFO)
gin.parse_config_files_and_bindings(FLAGS.gin_file, FLAGS.gin_param)
train_eval(FLAGS.root_dir)
if __name__ == '__main__':
flags.mark_flag_as_required('root_dir')
app.run(main)
What is the appropriate way to create a batched environment for a custom, non-batched environment? I can share my custom environment, but I don't believe the issue lies there as the code works fine when using batch sizes of 1.
Also, any tips on increasing GPU utilization in reinforcement learning scenarios would be greatly appreciated. I have examined examples of using tensorboard-profiler to profile GPU utilization, but it seems these require callbacks and a fit function, which doesn't seem to be applicable in RL use-cases.
It turns out I neglected to pass batch_size when initializing the AverageReturnMetric and AverageEpisodeLengthMetric instances.

Op type not registered \'IO>BigQueryClient\' with BigQuery connector on AI platform

I'm trying to parallelize the training step of my model with tensorflow ParameterServerStrategy. I work with GCP AI Platform to create the cluster and launch the task.
As my dataset is huge, I use the bigquery tensorflow connector included in tensorflow-io.
My script is inspired by the documentation of tensorflow bigquery reader and the documentation of tensorflow ParameterServerStrategy
Locally my script works well but when I launch it with AI Platform I get the following error :
{"created":"#1633444428.903993309","description":"Error received from peer ipv4:10.46.92.135:2222","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Op type not registered \'IO>BigQueryClient\' in binary running on gke-cml-1005-141531--n1-standard-16-2-644bc3f8-7h8p. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.","grpc_status":5}
The scripts works with fake data on AI platform and works locally with bigquery connector.
I imagine that the compilation of the model including the bigquery connector and its calls on other devices creates the bug but I don't know how to fix it.
I read this error happens when devices don't have same tensorflow versions so I checked tensorflow and tensorflow-io version on each device.
tensorflow : 2.5.0
tensorflow-io : 0.19.1
I created a similar example which reproduce the bug on AI platform
import os
from tensorflow_io.bigquery import BigQueryClient
from tensorflow_io.bigquery import BigQueryReadSession
import tensorflow as tf
import multiprocessing
import portpicker
from tensorflow.keras.layers.experimental import preprocessing
from google.cloud import bigquery
from tensorflow.python.framework import dtypes
import numpy as np
import pandas as pd
client = bigquery.Client()
PROJECT_ID = <your_project>
DATASET_ID = 'tmp'
TABLE_ID = 'bq_tf_io'
BATCH_SIZE = 32
# Bigquery requirements
def init_bq_table():
table = '%s.%s.%s' %(PROJECT_ID, DATASET_ID, TABLE_ID)
# Create toy_data
def create_toy_data(N):
x = np.random.random(size = N)
y = 0.2 + x + np.random.normal(loc=0, scale = 0.3, size = N)
return x, y
x, y =create_toy_data(1000)
df = pd.DataFrame(data = {'x': x, 'y': y})
job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE",)
job = client.load_table_from_dataframe( df, table, job_config=job_config )
job.result()
# Create initial data
#init_bq_table()
CSV_SCHEMA = [
bigquery.SchemaField("x", "FLOAT64"),
bigquery.SchemaField("y", "FLOAT64"),
]
def transform_row(row_dict):
# Trim all string tensors
dataset_x = row_dict
dataset_x['constant'] = tf.cast(1, tf.float64)
# Extract feature column
dataset_y = dataset_x.pop('y')
#Export as tensor
dataset_x = tf.stack([dataset_x[column] for column in dataset_x], axis=-1)
return (dataset_x, dataset_y)
def read_bigquery(table_name):
tensorflow_io_bigquery_client = BigQueryClient()
read_session = tensorflow_io_bigquery_client.read_session(
"projects/" + PROJECT_ID,
PROJECT_ID, TABLE_ID, DATASET_ID,
list(field.name for field in CSV_SCHEMA),
list(dtypes.double if field.field_type == 'FLOAT64'
else dtypes.string for field in CSV_SCHEMA),
requested_streams=2)
dataset = read_session.parallel_read_rows()
return dataset
def get_data():
dataset = read_bigquery(TABLE_ID)
dataset = dataset.map(transform_row, num_parallel_calls=4)
dataset = dataset.batch(BATCH_SIZE).prefetch(2)
return dataset
cluster_resolver = tf.distribute.cluster_resolver.TFConfigClusterResolver()
# parameter server and worker just wait jobs from the coordinator (chief)
if cluster_resolver.task_type in ("worker"):
worker_config = tf.compat.v1.ConfigProto()
server = tf.distribute.Server(
cluster_resolver.cluster_spec(),
job_name=cluster_resolver.task_type,
task_index=cluster_resolver.task_id,
config=worker_config,
protocol="grpc")
server.join()
elif cluster_resolver.task_type in ("ps"):
server = tf.distribute.Server(
cluster_resolver.cluster_spec(),
job_name=cluster_resolver.task_type,
task_index=cluster_resolver.task_id,
protocol="grpc")
server.join()
elif cluster_resolver.task_type == 'chief':
strategy = tf.distribute.experimental.ParameterServerStrategy(cluster_resolver=cluster_resolver)
if cluster_resolver.task_type == 'chief':
learning_rate = 0.01
with strategy.scope():
# model
model_input = tf.keras.layers.Input(
shape=(2,), dtype=tf.float64)
layer_1 = tf.keras.layers.Dense( 8, activation='relu')(model_input)
dense_output = tf.keras.layers.Dense(1)(layer_1)
model = tf.keras.Model(model_input, dense_output)
#optimizer
optimizer=tf.keras.optimizers.SGD(learning_rate=learning_rate)
accuracy = tf.keras.metrics.MeanSquaredError()
#tf.function
def distributed_train_step(iterator):
def train_step(x_batch_train, y_batch_train):
with tf.GradientTape() as tape:
y_predict = model(x_batch_train, training=True)
loss_value = tf.keras.losses.MeanSquaredError(reduction=tf.keras.losses.Reduction.NONE)(y_batch_train, y_predict)
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))
accuracy.update_state(y_batch_train, y_predict)
return loss_value
x_batch_train, y_batch_train = next(iterator)
return strategy.run(train_step, args=(x_batch_train, y_batch_train))
coordinator = tf.distribute.experimental.coordinator.ClusterCoordinator(strategy)
#test
def dataset_fn(_):
def create_toy_data(N):
x = np.random.random(size = N)
y = 0.2 + x + np.random.normal(loc=0, scale = 0.3, size = N)
return np.c_[x,y]
def toy_transform_row(row):
dataset_x = tf.stack([row[0], tf.cast(1, tf.float64)], axis=-1)
dataset_y = row[1]
return dataset_x, dataset_y
N = 1000
data =create_toy_data(N)
dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.map(toy_transform_row, num_parallel_calls=4)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(2)
return dataset
#tf.function
def per_worker_dataset_fn():
return strategy.distribute_datasets_from_function(lambda x : get_data()) # <-- Not working with AI platform
#return strategy.distribute_datasets_from_function(dataset_fn) # <-- Working with AI platform
per_worker_dataset = coordinator.create_per_worker_dataset(per_worker_dataset_fn)
# Train model
for epoch in range(5):
per_worker_iterator = iter(per_worker_dataset)
accuracy.reset_states()
for step in range(5):
coordinator.schedule(distributed_train_step, args=(per_worker_iterator,))
coordinator.join()
print ("Finished epoch %d, accuracy is %f." % (epoch, accuracy.result().numpy()))
When I create the dataset with per_worker_dataset_fn() I can use the bigquery connector (bugging) or create the dataset in live (working).
AI Platform Cluster configuration :
runtimeVersion: "2.5"
pythonVersion: "3.7"
Did someone get this issue ? Bigquery connector worked pretty well with MirroredStrategy on AI Platform. Tell me if I should report the issue somewhere else.
I think this is due to lazy loading of libtensorflow_io.so.
https://github.com/tensorflow/io/commit/85d018ee59ceccfae06914ec2a2f6d6583775ff7
Can you try adding something like this to your code:
import tensorflow_io
tensorflow_io.experimental.oss()
As far as I understand this happens because when you submit your training job to Cloud AI training, it is using a stock TensorFlow 2.5 environment that doesn't have tensorflow-io package installed. Therefore it is complaining that it doesn't know about 'IO>BigQueryClient' op defined in tensorflow-io package.
Instead you can submit your training job to be using a custom container:
https://cloud.google.com/ai-platform/training/docs/custom-containers-training
You don't need to write a new Docker file, you can use
gcr.io/deeplearning-platform-release/tf-cpu.2-5
or
gcr.io/deeplearning-platform-release/tf-gpu.2-5 (if your training job needs GPU) that has the right version of tensorflow-io installed.
You can read more about these containers here:
https://cloud.google.com/tensorflow-enterprise/docs/use-with-deep-learning-containers
Here is my old example showing how to run a distributed training on Cloud AI using BigQueryReader: https://github.com/vlasenkoalexey/criteo/blob/master/scripts/train-cloud.sh
It is no longer maintained, but should give you a general idea how it should look like.

How to automatically assign free GPUs in TensorFlow

I have 4 Tesla K80 GPUs in my system. I would like to automatically allocate free GPUs based on an integer input in the code. I am aware of tf.config.experimental.set_visible_devices() to assign specific GPUs but currently do not know how to identify which of the GPUs are in-use (expect manually using nvidia-smi). I am currently changing the code below for every run.
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only use the first GPU
try:
tf.config.experimental.set_visible_devices(gpus[2:], 'GPU')
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
except RuntimeError as e:
# Visible devices must be set before GPUs have been initialized
print(e)
The above code lets me set the GPUs I want to allocate (GPU 2,3 in above example) for the run. Is there anyway to obtain a list of free (unused) devices to automate the allocation process instead manually having to identify which of the devices should be set?
I am currently using TensorFlow version 1.15
import subprocess, re
import os
import utils
# Nvidia-smi GPU memory parsing.
# Tested on nvidia-smi 370.23
# TF1.15
def run_command(cmd):
"""Run command, return output as string."""
output = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True).communicate()[0]
return output.decode("ascii")
def list_available_gpus():
"""Returns list of available GPU ids."""
output = run_command("nvidia-smi -L")
# lines of the form GPU 0: TITAN X
gpu_regex = re.compile(r"GPU (?P<gpu_id>\d+):")
result = []
for line in output.strip().split("\n"):
m = gpu_regex.match(line)
assert m, "Couldnt parse "+line
result.append(int(m.group("gpu_id")))
return result
def gpu_memory_map():
"""Returns map of GPU id to memory allocated on that GPU."""
output = run_command("nvidia-smi")
gpu_output = output[output.find("GPU Memory"):]
# lines of the form
# | 0 8734 C python 11705MiB |
memory_regex = re.compile(r"[|]\s+?(?P<gpu_id>\d+)\D+?(?P<pid>\d+).+[ ](?P<gpu_memory>\d+)MiB")
rows = gpu_output.split("\n")
result = {gpu_id: 0 for gpu_id in list_available_gpus()}
for row in gpu_output.split("\n"):
m = memory_regex.search(row)
if not m:
continue
gpu_id = int(m.group("gpu_id"))
gpu_memory = int(m.group("gpu_memory"))
result[gpu_id] += gpu_memory
return result
def pick_gpu_lowest_memory():
"""Returns GPU with the least allocated memory"""
memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
best_memory, best_gpu = sorted(memory_gpu_map)[0]
return best_gpu
def pick_free_gpus(num_gpus=1):
"""Returns free GPUs with the least allocated memory"""
memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
sorted_list = sorted(memory_gpu_map)
gpu_list = []
for i in range(num_gpus):
if sorted_list[i][0] == 0:
gpu_list.append(sorted_list[i][1])
else:
print(f'Currently fewer than {num_gpus} GPUs are free right now, choose {i} or fewer GPUs')
exit()
return ','.join(map(str, gpu_list))
num_gpus = 2
os.environ["CUDA_VISIBLE_DEVICES"] = pick_free_gpus(num_gpus)
import tensorflow as tf
tf.config.optimizer.set_jit(True) # Enable XLA.
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only use the first GPU
try:
tf.config.experimental.set_visible_devices(gpus, 'GPU')
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
except RuntimeError as e:
# Visible devices must be set before GPUs have been initialized
print(e)

Implement early stopping in tf.estimator.DNNRegressor using the available training hooks

I am new to tensorflow and want to implement early stopping in tf.estimator.DNNRegressor with available training hooksTraining Hooks for the MNIST dataset. The early stopping hook will stop training if the loss does not improve for some specified number of steps. Tensorflow documentaton only provides example for Logging hooks. Can someone write a code snippet for implementing it?
Here is a EarlyStoppingHook sample implementation:
import numpy as np
import tensorflow as tf
import logging
from tensorflow.python.training import session_run_hook
class EarlyStoppingHook(session_run_hook.SessionRunHook):
"""Hook that requests stop at a specified step."""
def __init__(self, monitor='val_loss', min_delta=0, patience=0,
mode='auto'):
"""
"""
self.monitor = monitor
self.patience = patience
self.min_delta = min_delta
self.wait = 0
if mode not in ['auto', 'min', 'max']:
logging.warning('EarlyStopping mode %s is unknown, '
'fallback to auto mode.', mode, RuntimeWarning)
mode = 'auto'
if mode == 'min':
self.monitor_op = np.less
elif mode == 'max':
self.monitor_op = np.greater
else:
if 'acc' in self.monitor:
self.monitor_op = np.greater
else:
self.monitor_op = np.less
if self.monitor_op == np.greater:
self.min_delta *= 1
else:
self.min_delta *= -1
self.best = np.Inf if self.monitor_op == np.less else -np.Inf
def begin(self):
# Convert names to tensors if given
graph = tf.get_default_graph()
self.monitor = graph.as_graph_element(self.monitor)
if isinstance(self.monitor, tf.Operation):
self.monitor = self.monitor.outputs[0]
def before_run(self, run_context): # pylint: disable=unused-argument
return session_run_hook.SessionRunArgs(self.monitor)
def after_run(self, run_context, run_values):
current = run_values.results
if self.monitor_op(current - self.min_delta, self.best):
self.best = current
self.wait = 0
else:
self.wait += 1
if self.wait >= self.patience:
run_context.request_stop()
This implementation is based on Keras implementation.
To use it with CNN MNIST example create hook and pass it to train.
early_stopping_hook = EarlyStoppingHook(monitor='sparse_softmax_cross_entropy_loss/value', patience=10)
mnist_classifier.train(
input_fn=train_input_fn,
steps=20000,
hooks=[logging_hook, early_stopping_hook])
Here sparse_softmax_cross_entropy_loss/value is the name of the loss op in that example.
EDIT 1:
It looks like there is no "official" way of finding loss node when using estimators (or I can't find it).
For the DNNRegressor this node has name dnn/head/weighted_loss/Sum.
Here is how to find it in the graph:
Start tensorboard in model directory. In my case I didn't set any directory so estimator used temporary directory and printed this line:
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmpInj8SC
Start tensorboard:
tensorboard --logdir /tmp/tmpInj8SC
Open it in browser and navigate to GRAPHS tab.
Find loss in the graph. Expand blocks in the sequence: dnn → head → weighted_loss and click on the Sum node (note that there is summary node named loss connected to it).
Name shown in the info "window" to the right is the name of the selected node, that need to be passed to monitor argument pf EarlyStoppingHook.
Loss node of the DNNClassifier has the same name by default. Both DNNClassifier and DNNRegressor have optional argument loss_reduction that influences loss node name and behavior (defaults to losses.Reduction.SUM).
EDIT 2:
There is a way of finding loss without looking at the graph.
You can use GraphKeys.LOSSES collection to get the loss. But this way will work only after training started. So you can use it only in a hook.
For example you can remove monitor argument from the EarlyStoppingHook class and change its begin function to always use the first loss in the collection:
self.monitor = tf.get_default_graph().get_collection(tf.GraphKeys.LOSSES)[0]
You also probably need to check that there is a loss in the collection.