i am trying to train a tflite model using just people in coco dataset.
I am using tflite model maker to train and fiftyone to process dataset.
when running the training file .py i get the error below.
root#85ac26b47f92:/external# root#85ac26b47f92:/external# python demofie.py
2022-11-01 21:02:01.059188: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: UNKNOWN ERROR (34)
2022-11-01 21:02:01.059234: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 85ac26b47f92
2022-11-01 21:02:01.059242: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 85ac26b47f92
2022-11-01 21:02:01.059324: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2022-11-01 21:02:01.059381: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3
2022-11-01 21:02:01.059821: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
File "demofie.py", line 20, in <module>
train_data = object_detector.DataLoader.from_pascal_voc(images_dir='/external/train/data',annotations_dir='/external/train/labels', label_map=['person'],ignore_difficult_instances= False,num_shards = 100)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_examples/lite/model_maker/core/data_util/object_detector_dataloader.py", line 217, in from_pascal_voc
cache_writer.write_files(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_examples/lite/model_maker/core/data_util/object_detector_dataloader_util.py", line 252, in write_files
tf_example = create_pascal_tfrecord.dict_to_tf_example(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_examples/lite/model_maker/third_party/efficientdet/dataset/create_pascal_tfrecord.py", line 162, in dict_to_tf_example
if obj['difficult'] == 'Unspecified':
KeyError: 'difficult'
code that causes the error. can anyone with better coding knowledge than me shed some light on any mistakes i may have made.
I have added the fiftyone code below this (no error)
import numpy as np
import os
from tflite_model_maker.config import QuantizationConfig
from tflite_model_maker.config import ExportFormat
from tflite_model_maker import model_spec
from tflite_model_maker import object_detector
import tensorflow as tf
assert tf.__version__.startswith('2')
tf.get_logger().setLevel('ERROR')
from absl import logging
logging.set_verbosity(logging.ERROR)
spec = model_spec.get('efficientdet_lite1')
train_data = object_detector.DataLoader.from_pascal_voc(images_dir='/external/train/data',annotations_dir='/external/train/labels', label_map=['person'],ignore_difficult_instances= False,num_shards = 100)
validation_data = object_detector.DataLoader.from_pascal_voc(images_dir='/external/val/data',annotations_dir='/external/val/labels',label_map= ['person'],ignore_difficult_instances= False,num_shards = 100)
test_data = object_detector.DataLoader.from_pascal_voc(images_dir='/external/test/data',annotations_dir='/external/test/labels',label_map= ['person'],ignore_difficult_instances= False,num_shards = 100)
model = object_detector.create(train_data, model_spec=spec, batch_size=8,epochs=2000, train_whole_model=True, validation_data=validation_data)
model.evaluate(test_data)
model.export(export_dir='/external/')
**dataset generation code
**
import fiftyone.zoo as foz
import fiftyone as fo
from fiftyone import ViewField as F
cocodataset_test = foz.load_zoo_dataset(
"coco-2017",
splits="test",
label_types=["detections"],
classes=["person"],
only_matching=True,
# max_samples=50,
)
cocodataset_validation = foz.load_zoo_dataset(
"coco-2017",
splits="validation",
label_types=["detections"],
classes=["person"],
only_matching=True,
# max_samples=50
)
cocodataset_train = foz.load_zoo_dataset(
"coco-2017",
splits="train",
label_types=["detections"],
classes=["person"],
only_matching=True,
# max_samples=50,
)
cocodataset_validation.export(
'/external/val',
fo.types.VOCDetectionDataset,
)
cocodataset_train.export(
'/external/train/',
fo.types.VOCDetectionDataset,
)
cocodataset_test.export(
'/external/test/',
fo.types.VOCDetectionDataset,
)
Related
I am trying to create a batched environment version of an SAC agent example from the Tensorflow Agents library, the original code can be found here. I am also using a custom environment.
I am pursuing a batched environment setup in order to better leverage GPU resources in order to speed up training. My understanding is that by passing batches of trajectories to the GPU, there will be less overhead incurred when passing data from the host (CPU) to the device (GPU).
My custom environment is called SacEnv, and I attempt to create a batched environment like so:
py_envs = [SacEnv() for _ in range(0, batch_size)]
batched_env = batched_py_environment.BatchedPyEnvironment(envs=py_envs)
tf_env = tf_py_environment.TFPyEnvironment(batched_env)
My hope is that this will create a batched environment consisting of a 'batch' of non-batched environments. However I am receiving the following error when running the code:
ValueError: Cannot assign value to variable ' Accumulator:0': Shape mismatch.The variable shape (1,), and the assigned value shape (32,) are incompatible.
with the stack trace:
Traceback (most recent call last):
File "/home/gary/Desktop/code/sac_test/sac_main2.py", line 370, in <module>
app.run(main)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/gary/Desktop/code/sac_test/sac_main2.py", line 366, in main
train_eval(FLAGS.root_dir)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/home/gary/Desktop/code/sac_test/sac_main2.py", line 274, in train_eval
results = metric_utils.eager_compute(
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/eval/metric_utils.py", line 163, in eager_compute
common.function(driver.run)(time_step, policy_state)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 211, in run
return self._run_fn(
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/utils/common.py", line 188, in with_check_resource_vars
return fn(*fn_args, **fn_kwargs)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 238, in _run
tf.while_loop(
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 154, in loop_body
observer_ops = [observer(traj) for observer in self._observers]
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 154, in <listcomp>
observer_ops = [observer(traj) for observer in self._observers]
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metric.py", line 93, in __call__
return self._update_state(*args, **kwargs)
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metric.py", line 81, in _update_state
return self.call(*arg, **kwargs)
ValueError: in user code:
File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metrics.py", line 176, in call *
self._return_accumulator.assign(
ValueError: Cannot assign value to variable ' Accumulator:0': Shape mismatch.The variable shape (1,), and the assigned value shape (32,) are incompatible.
In call to configurable 'eager_compute' (<function eager_compute at 0x7fa4d6e5e040>)
In call to configurable 'train_eval' (<function train_eval at 0x7fa4c8622dc0>)
I have dug through the tf_metric.py code to try and understand the error, however I have been unsuccessful. A related issue was solved when I added the batch size (32) to the initializer for the AverageReturnMetric instance, and this issue seems related.
The full code is:
# coding=utf-8
# Copyright 2020 The TF-Agents Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Lint as: python2, python3
r"""Train and Eval SAC.
All hyperparameters come from the SAC paper
https://arxiv.org/pdf/1812.05905.pdf
To run:
```bash
tensorboard --logdir $HOME/tmp/sac/gym/HalfCheetah-v2/ --port 2223 &
python tf_agents/agents/sac/examples/v2/train_eval.py \
--root_dir=$HOME/tmp/sac/gym/HalfCheetah-v2/ \
--alsologtostderr
\```
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from sac_env import SacEnv
import os
import time
from absl import app
from absl import flags
from absl import logging
import gin
from six.moves import range
import tensorflow as tf # pylint: disable=g-explicit-tensorflow-version-import
from tf_agents.agents.ddpg import critic_network
from tf_agents.agents.sac import sac_agent
from tf_agents.agents.sac import tanh_normal_projection_network
from tf_agents.drivers import dynamic_step_driver
#from tf_agents.environments import suite_mujoco
from tf_agents.environments import tf_py_environment
from tf_agents.environments import batched_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import actor_distribution_network
from tf_agents.policies import greedy_policy
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.utils import common
from tf_agents.train.utils import strategy_utils
flags.DEFINE_string('root_dir', os.getenv('TEST_UNDECLARED_OUTPUTS_DIR'),
'Root directory for writing logs/summaries/checkpoints.')
flags.DEFINE_multi_string('gin_file', None, 'Path to the trainer config files.')
flags.DEFINE_multi_string('gin_param', None, 'Gin binding to pass through.')
FLAGS = flags.FLAGS
gpus = tf.config.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
print(e)
#gin.configurable
def train_eval(
root_dir,
env_name='SacEnv',
# The SAC paper reported:
# Hopper and Cartpole results up to 1000000 iters,
# Humanoid results up to 10000000 iters,
# Other mujoco tasks up to 3000000 iters.
num_iterations=3000000,
actor_fc_layers=(256, 256),
critic_obs_fc_layers=None,
critic_action_fc_layers=None,
critic_joint_fc_layers=(256, 256),
# Params for collect
# Follow https://github.com/haarnoja/sac/blob/master/examples/variants.py
# HalfCheetah and Ant take 10000 initial collection steps.
# Other mujoco tasks take 1000.
# Different choices roughly keep the initial episodes about the same.
#initial_collect_steps=10000,
initial_collect_steps=2000,
collect_steps_per_iteration=1,
replay_buffer_capacity=31250, # 1000000 / 32
# Params for target update
target_update_tau=0.005,
target_update_period=1,
# Params for train
train_steps_per_iteration=1,
#batch_size=256,
batch_size=32,
actor_learning_rate=3e-4,
critic_learning_rate=3e-4,
alpha_learning_rate=3e-4,
td_errors_loss_fn=tf.math.squared_difference,
gamma=0.99,
reward_scale_factor=0.1,
gradient_clipping=None,
use_tf_functions=True,
# Params for eval
num_eval_episodes=30,
eval_interval=10000,
# Params for summaries and logging
train_checkpoint_interval=50000,
policy_checkpoint_interval=50000,
rb_checkpoint_interval=50000,
log_interval=1000,
summary_interval=1000,
summaries_flush_secs=10,
debug_summaries=False,
summarize_grads_and_vars=False,
eval_metrics_callback=None):
"""A simple train and eval for SAC."""
root_dir = os.path.expanduser(root_dir)
train_dir = os.path.join(root_dir, 'train')
eval_dir = os.path.join(root_dir, 'eval')
train_summary_writer = tf.compat.v2.summary.create_file_writer(
train_dir, flush_millis=summaries_flush_secs * 1000)
train_summary_writer.set_as_default()
eval_summary_writer = tf.compat.v2.summary.create_file_writer(
eval_dir, flush_millis=summaries_flush_secs * 1000)
eval_metrics = [
tf_metrics.AverageReturnMetric(buffer_size=num_eval_episodes),
tf_metrics.AverageEpisodeLengthMetric(buffer_size=num_eval_episodes)
]
global_step = tf.compat.v1.train.get_or_create_global_step()
with tf.compat.v2.summary.record_if(
lambda: tf.math.equal(global_step % summary_interval, 0)):
py_envs = [SacEnv() for _ in range(0, batch_size)]
batched_env = batched_py_environment.BatchedPyEnvironment(envs=py_envs)
tf_env = tf_py_environment.TFPyEnvironment(batched_env)
eval_py_envs = [SacEnv() for _ in range(0, batch_size)]
eval_batched_env = batched_py_environment.BatchedPyEnvironment(envs=eval_py_envs)
eval_tf_env = tf_py_environment.TFPyEnvironment(eval_batched_env)
time_step_spec = tf_env.time_step_spec()
observation_spec = time_step_spec.observation
action_spec = tf_env.action_spec()
strategy = strategy_utils.get_strategy(tpu=False, use_gpu=True)
with strategy.scope():
actor_net = actor_distribution_network.ActorDistributionNetwork(
observation_spec,
action_spec,
fc_layer_params=actor_fc_layers,
continuous_projection_net=tanh_normal_projection_network
.TanhNormalProjectionNetwork)
critic_net = critic_network.CriticNetwork(
(observation_spec, action_spec),
observation_fc_layer_params=critic_obs_fc_layers,
action_fc_layer_params=critic_action_fc_layers,
joint_fc_layer_params=critic_joint_fc_layers,
kernel_initializer='glorot_uniform',
last_kernel_initializer='glorot_uniform')
tf_agent = sac_agent.SacAgent(
time_step_spec,
action_spec,
actor_network=actor_net,
critic_network=critic_net,
actor_optimizer=tf.compat.v1.train.AdamOptimizer(
learning_rate=actor_learning_rate),
critic_optimizer=tf.compat.v1.train.AdamOptimizer(
learning_rate=critic_learning_rate),
alpha_optimizer=tf.compat.v1.train.AdamOptimizer(
learning_rate=alpha_learning_rate),
target_update_tau=target_update_tau,
target_update_period=target_update_period,
td_errors_loss_fn=td_errors_loss_fn,
gamma=gamma,
reward_scale_factor=reward_scale_factor,
gradient_clipping=gradient_clipping,
debug_summaries=debug_summaries,
summarize_grads_and_vars=summarize_grads_and_vars,
train_step_counter=global_step)
tf_agent.initialize()
# Make the replay buffer.
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
data_spec=tf_agent.collect_data_spec,
batch_size=batch_size,
max_length=replay_buffer_capacity,
device="/device:GPU:0")
replay_observer = [replay_buffer.add_batch]
train_metrics = [
tf_metrics.NumberOfEpisodes(),
tf_metrics.EnvironmentSteps(),
tf_metrics.AverageReturnMetric(
buffer_size=num_eval_episodes, batch_size=tf_env.batch_size),
tf_metrics.AverageEpisodeLengthMetric(
buffer_size=num_eval_episodes, batch_size=tf_env.batch_size),
]
eval_policy = greedy_policy.GreedyPolicy(tf_agent.policy)
initial_collect_policy = random_tf_policy.RandomTFPolicy(
tf_env.time_step_spec(), tf_env.action_spec())
collect_policy = tf_agent.collect_policy
train_checkpointer = common.Checkpointer(
ckpt_dir=train_dir,
agent=tf_agent,
global_step=global_step,
metrics=metric_utils.MetricsGroup(train_metrics, 'train_metrics'))
policy_checkpointer = common.Checkpointer(
ckpt_dir=os.path.join(train_dir, 'policy'),
policy=eval_policy,
global_step=global_step)
rb_checkpointer = common.Checkpointer(
ckpt_dir=os.path.join(train_dir, 'replay_buffer'),
max_to_keep=1,
replay_buffer=replay_buffer)
train_checkpointer.initialize_or_restore()
rb_checkpointer.initialize_or_restore()
initial_collect_driver = dynamic_step_driver.DynamicStepDriver(
tf_env,
initial_collect_policy,
observers=replay_observer + train_metrics,
num_steps=initial_collect_steps)
collect_driver = dynamic_step_driver.DynamicStepDriver(
tf_env,
collect_policy,
observers=replay_observer + train_metrics,
num_steps=collect_steps_per_iteration)
if use_tf_functions:
initial_collect_driver.run = common.function(initial_collect_driver.run)
collect_driver.run = common.function(collect_driver.run)
tf_agent.train = common.function(tf_agent.train)
if replay_buffer.num_frames() == 0:
# Collect initial replay data.
logging.info(
'Initializing replay buffer by collecting experience for %d steps '
'with a random policy.', initial_collect_steps)
initial_collect_driver.run()
results = metric_utils.eager_compute(
eval_metrics,
eval_tf_env,
eval_policy,
num_episodes=num_eval_episodes,
train_step=global_step,
summary_writer=eval_summary_writer,
summary_prefix='Metrics',
)
if eval_metrics_callback is not None:
eval_metrics_callback(results, global_step.numpy())
metric_utils.log_metrics(eval_metrics)
time_step = None
policy_state = collect_policy.get_initial_state(tf_env.batch_size)
timed_at_step = global_step.numpy()
time_acc = 0
# Prepare replay buffer as dataset with invalid transitions filtered.
def _filter_invalid_transition(trajectories, unused_arg1):
return ~trajectories.is_boundary()[0]
dataset = replay_buffer.as_dataset(
sample_batch_size=batch_size,
num_steps=2).unbatch().filter(
_filter_invalid_transition).batch(batch_size).prefetch(5)
# Dataset generates trajectories with shape [Bx2x...]
iterator = iter(dataset)
def train_step():
experience, _ = next(iterator)
return tf_agent.train(experience)
if use_tf_functions:
train_step = common.function(train_step)
global_step_val = global_step.numpy()
while global_step_val < num_iterations:
start_time = time.time()
time_step, policy_state = collect_driver.run(
time_step=time_step,
policy_state=policy_state,
)
for _ in range(train_steps_per_iteration):
train_loss = train_step()
time_acc += time.time() - start_time
global_step_val = global_step.numpy()
if global_step_val % log_interval == 0:
logging.info('step = %d, loss = %f', global_step_val,
train_loss.loss)
steps_per_sec = (global_step_val - timed_at_step) / time_acc
logging.info('%.3f steps/sec', steps_per_sec)
tf.compat.v2.summary.scalar(
name='global_steps_per_sec', data=steps_per_sec, step=global_step)
timed_at_step = global_step_val
time_acc = 0
for train_metric in train_metrics:
train_metric.tf_summaries(
train_step=global_step, step_metrics=train_metrics[:2])
if global_step_val % eval_interval == 0:
results = metric_utils.eager_compute(
eval_metrics,
eval_tf_env,
eval_policy,
num_episodes=num_eval_episodes,
train_step=global_step,
summary_writer=eval_summary_writer,
summary_prefix='Metrics',
)
if eval_metrics_callback is not None:
eval_metrics_callback(results, global_step_val)
metric_utils.log_metrics(eval_metrics)
if global_step_val % train_checkpoint_interval == 0:
train_checkpointer.save(global_step=global_step_val)
if global_step_val % policy_checkpoint_interval == 0:
policy_checkpointer.save(global_step=global_step_val)
if global_step_val % rb_checkpoint_interval == 0:
rb_checkpointer.save(global_step=global_step_val)
return train_loss
def main(_):
tf.compat.v1.enable_v2_behavior()
logging.set_verbosity(logging.INFO)
gin.parse_config_files_and_bindings(FLAGS.gin_file, FLAGS.gin_param)
train_eval(FLAGS.root_dir)
if __name__ == '__main__':
flags.mark_flag_as_required('root_dir')
app.run(main)
What is the appropriate way to create a batched environment for a custom, non-batched environment? I can share my custom environment, but I don't believe the issue lies there as the code works fine when using batch sizes of 1.
Also, any tips on increasing GPU utilization in reinforcement learning scenarios would be greatly appreciated. I have examined examples of using tensorboard-profiler to profile GPU utilization, but it seems these require callbacks and a fit function, which doesn't seem to be applicable in RL use-cases.
It turns out I neglected to pass batch_size when initializing the AverageReturnMetric and AverageEpisodeLengthMetric instances.
When I run this code https://github.com/erezposner/Pose2Seg
And I made all steps in this tutorial https://towardsdatascience.com/detection-free-human-instance-segmentation-using-pose2seg-and-pytorch-72f48dc4d23e
but I have this error in cuda:
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 4.00 GiB total capacity; 2.57 GiB already allocated; 74.77 MiB free; 2.85 GiB reserved in total by PyTorch) (malloc at ..\c10\cuda\CUDACachingAllocator.cpp:289) (no backtrace available)
How can I solve this?
(base) C:\Users\ASUS\Pose2Seg>python train.py
06-23 07:30:01 ===========> loading model <===========
total params in model is 334, in pretrained model is 336, init 334
06-23 07:30:03 ===========> loading data <===========
loading annotations into memory...
Done (t=4.56s)
creating index...
index created!
06-23 07:30:08 ===========> set optimizer <===========
06-23 07:30:08 ===========> training <===========
C:\Users\ASUS\Anaconda3\Anaconda\lib\site-packages\torch\nn\functional.py:2796: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.
warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.")
C:\Users\ASUS\Anaconda3\Anaconda\lib\site-packages\torch\nn\functional.py:2973: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
"See the documentation of nn.Upsample for details.".format(mode))
C:\Users\ASUS\Anaconda3\Anaconda\lib\site-packages\torch\nn\functional.py:3289: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
warnings.warn("Default grid_sample and affine_grid behavior has changed "
C:\Users\ASUS\Anaconda3\Anaconda\lib\site-packages\torch\nn\functional.py:3226: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
warnings.warn("Default grid_sample and affine_grid behavior has changed "
06-23 07:30:13 Epoch: [0][0/56599] Lr: [6.68e-05] Time 4.228 (4.228) Data 0.028 (0.028) loss 0.85738 (0.85738)
06-23 07:30:22 Epoch: [0][10/56599] Lr: [6.813333333333334e-05] Time 0.847 (1.280) Data 0.012 (0.051) loss 0.44195 (0.71130)
06-23 07:30:33 Epoch: [0][20/56599] Lr: [6.946666666666667e-05] Time 0.882 (1.180) Data 0.045 (0.037) loss 0.41523 (0.60743)
Traceback (most recent call last):
File "train.py", line 157, in <module>
optimizer, epoch, iteration)
File "train.py", line 74, in train
loss.backward()
File "C:\Users\ASUS\Anaconda3\Anaconda\lib\site-packages\torch\tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Users\ASUS\Anaconda3\Anaconda\lib\site-packages\torch\autograd\__init__.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 4.00 GiB total capacity; 2.57 GiB already allocated; 74.77 MiB free; 2.85 GiB reserved in total by PyTorch) (malloc at ..\c10\cuda\CUDACachingAllocator.cpp:289)
(no backtrace available)
cudatoolkit == 10.1.243
python3.6.5
The version of libs:
>>> import tensorflow
2020-06-23 09:45:01.840827: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
>>> tensorflow.__version__
'2.2.0'
>>> import keras
Using TensorFlow backend.
>>> keras.__version__
'2.3.1'
>>> import torch
>>> torch.__version__
'1.5.1'
>>> import torchvision
>>> torchvision.__version__
'0.6.1'
>>> import pycocotools
train.py code
import os
import sys
import time
import logging
import argparse
import numpy as np
from tqdm import tqdm
import torch
import torch.utils.data
from lib.averageMeter import AverageMeters
from lib.logger import colorlogger
from lib.timer import Timers
from lib.averageMeter import AverageMeters
from lib.torch_utils import adjust_learning_rate
import os
from modeling.build_model import Pose2Seg
from datasets.CocoDatasetInfo import CocoDatasetInfo, annToMask
from test import test
NAME = "release_base"
# Set `LOG_DIR` and `SNAPSHOT_DIR`
def setup_logdir():
timestamp = time.strftime("%Y-%m-%d_%H_%M_%S", time.localtime())
LOGDIR = os.path.join(os.getcwd(), 'logs', '%s_%s' % (NAME, timestamp))
SNAPSHOTDIR = os.path.join(
os.getcwd(), 'snapshot', '%s_%s' % (NAME, timestamp))
if not os.path.exists(LOGDIR):
os.makedirs(LOGDIR)
if not os.path.exists(SNAPSHOTDIR):
os.makedirs(SNAPSHOTDIR)
return LOGDIR, SNAPSHOTDIR
LOGDIR, SNAPSHOTDIR = setup_logdir()
# Set logging
logger = colorlogger(log_dir=LOGDIR, log_name='train_logs.txt')
# Set Global Timer
timers = Timers()
# Set Global AverageMeter
averMeters = AverageMeters()
def train(model, dataloader, optimizer, epoch, iteration):
# switch to train mode
model.train()
averMeters.clear()
end = time.time()
for i, inputs in enumerate(dataloader):
averMeters['data_time'].update(time.time() - end)
iteration += 1
lr = adjust_learning_rate(optimizer, iteration, BASE_LR=0.0002,
WARM_UP_FACTOR=1.0/3, WARM_UP_ITERS=1000,
STEPS=(0, 14150*15, 14150*20), GAMMA=0.1)
# forward
outputs = model(**inputs)
# loss
loss = outputs
# backward
averMeters['loss'].update(loss.data.item())
optimizer.zero_grad()
loss.backward()
optimizer.step()
# measure elapsed time
averMeters['batch_time'].update(time.time() - end)
end = time.time()
if i % 10 == 0:
logger.info('Epoch: [{0}][{1}/{2}]\t'
'Lr: [{3}]\t'
'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
'Data {data_time.val:.3f} ({data_time.avg:.3f})\t'
'loss {loss.val:.5f} ({loss.avg:.5f})\t'
.format(
epoch, i, len(dataloader), lr,
batch_time=averMeters['batch_time'], data_time=averMeters['data_time'],
loss=averMeters['loss'])
)
if i % 10000 == 0:
torch.save(model.state_dict(), os.path.join(
SNAPSHOTDIR, '%d_%d.pkl' % (epoch, i)))
torch.save(model.state_dict(), os.path.join(
SNAPSHOTDIR, 'last.pkl'))
return iteration
class Dataset():
def __init__(self):
ImageRoot = r'C:\Users\ASUS\Pose2Seg\data\coco2017\train2017'
AnnoFile = r'C:\Users\ASUS\Pose2Seg\data\coco2017\annotations\person_keypoints_train2017_pose2seg.json'
self.datainfos = CocoDatasetInfo(
ImageRoot, AnnoFile, onlyperson=True, loadimg=True)
def __len__(self):
return len(self.datainfos)
def __getitem__(self, idx):
rawdata = self.datainfos[idx]
img = rawdata['data']
image_id = rawdata['id']
height, width = img.shape[0:2]
gt_kpts = np.float32(rawdata['gt_keypoints']).transpose(
0, 2, 1) # (N, 17, 3)
gt_segms = rawdata['segms']
gt_masks = np.array([annToMask(segm, height, width)
for segm in gt_segms])
return {'img': img, 'kpts': gt_kpts, 'masks': gt_masks}
def collate_fn(self, batch):
batchimgs = [data['img'] for data in batch]
batchkpts = [data['kpts'] for data in batch]
batchmasks = [data['masks'] for data in batch]
return {'batchimgs': batchimgs, 'batchkpts': batchkpts, 'batchmasks': batchmasks}
if __name__ == '__main__':
logger.info('===========> loading model <===========')
model = Pose2Seg().cuda()
# model.init("")
model.train()
logger.info('===========> loading data <===========')
datasetTrain = Dataset()
dataloaderTrain = torch.utils.data.DataLoader(datasetTrain, batch_size=1, shuffle=True,
num_workers=0, pin_memory=False,
collate_fn=datasetTrain.collate_fn)
logger.info('===========> set optimizer <===========')
''' set your optimizer like this. Normally is Adam/SGD. '''
#optimizer = torch.optim.SGD(model.parameters(), 0.0002, momentum=0.9, weight_decay=0.0005)
optimizer = torch.optim.Adam(
model.parameters(), 0.0002, weight_decay=0.0000)
iteration = 0
epoch = 0
try:
while iteration < 14150*25:
logger.info('===========> training <===========')
iteration = train(model, dataloaderTrain,
optimizer, epoch, iteration)
epoch += 1
logger.info('===========> testing <===========')
test(model, dataset='cocoVal', logger=logger.info)
test(model, dataset='OCHumanVal', logger=logger.info)
except (KeyboardInterrupt):
logger.info('Save ckpt on exception ...')
torch.save(model.state_dict(), os.path.join(
SNAPSHOTDIR, 'interrupt_%d_%d.pkl' % (epoch, iteration)))
logger.info('Save ckpt done.')
Your GPU doesn't have enough memory. Try to reduce the batch size. If still the same, try to reduce input image size. It should work fine then.
By the way, for this type of model, 8GB of GPU memory is recommended.
"I'm trying to train the ner model using spacy. It works fine for CPU. But when I try executing it using GPU I'm getting the following error. Spacy version 2.1.4, CUDA version 10.1"
"I tried re-installing thinc but still I'm getting the error"
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
import json
spacy.require_gpu()
nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
ner = nlp.create_pipe("ner")
for _, annotations in TRAIN_DATA:
for ent in annotations.get("entities"):
ner.add_label(ent[2])
optimizer = nlp.begin_training()
"I'm getting the following error"
"CUDARuntimeError
Traceback (most recent call last)
in
----> 1 optimizer = nlp.begin_training()
G:\Anaconda3\lib\site-packages\spacy\language.py in begin_training(self, get_gold_tuples, sgd, component_cfg, **cfg)
547 if self.vocab.vectors.data.shape[1] >= 1:
548 self.vocab.vectors.data = Model.ops.asarray(self.vocab.vectors.data)
--> 549 link_vectors_to_models(self.vocab)
550 if self.vocab.vectors.data.shape[1]:
551 cfg["pretrained_vectors"] = self.vocab.vectors.name
G:\Anaconda3\lib\site-packages\spacy_ml.py in link_vectors_to_models(vocab)
297 else:
298 word.rank = 0
--> 299 data = ops.asarray(vectors.data)
300 # Set an entry here, so that vectors are accessed by StaticVectors
301 # (unideal, I know)
ops.pyx in thinc.neural.ops.CupyOps.asarray()
G:\Anaconda3\lib\site-packages\cupy\creation\from_data.py in array(obj, dtype, copy, order, subok, ndmin)
39
40 """
---> 41 return core.array(obj, dtype, copy, order, subok, ndmin)
42
43
cupy\core\core.pyx in cupy.core.core.array()
cupy\core\core.pyx in cupy.core.core.array()
cupy\core\core.pyx in cupy.core.core.ndarray.__init__()
cupy\cuda\memory.pyx in cupy.cuda.memory.alloc()
cupy\cuda\memory.pyx in cupy.cuda.memory.MemoryPool.malloc()
cupy\cuda\memory.pyx in cupy.cuda.memory.MemoryPool.malloc()
cupy\cuda\device.pyx in cupy.cuda.device.get_device_id()
cupy\cuda\runtime.pyx in cupy.cuda.runtime.getDevice()
cupy\cuda\runtime.pyx in cupy.cuda.runtime.check_status()
CUDARuntimeError: cudaErrorUnknown: unknown error"
I have written a tensorflow code using the TPUEstimator, but I am having problems running it in use_tpu=False mode. I would like to run it on my local computer to make sure that all the operations are TPU-compatible. The code works fine with the normal Estimator. Here is my master code:
import logging
from tensorflow.contrib.tpu.python.tpu import tpu_config, tpu_estimator, tpu_optimizer
from tensorflow.contrib.cluster_resolver import TPUClusterResolver
from capser_7_model_fn import *
from capser_7_input_fn import *
import subprocess
from absl import flags
flags.DEFINE_bool(
'use_tpu', False,
'Use TPUs rather than plain CPUs')
tf.flags.DEFINE_string(
"tpu", default='$TPU_NAME',
help="The Cloud TPU to use for training. This should be either the name "
"used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
"url.")
tf.flags.DEFINE_string("model_dir", LOGDIR, "Estimator model_dir")
flags.DEFINE_integer(
'save_checkpoints_secs', 1000,
'Interval (in seconds) at which the model data '
'should be checkpointed. Set to 0 to disable.')
flags.DEFINE_integer(
'save_summary_steps', 100,
'Number of steps which must have run before showing summaries.')
tf.flags.DEFINE_integer("iterations", 1000,
"Number of iterations per TPU training loop.")
tf.flags.DEFINE_integer("num_shards", 8, "Number of shards (TPU chips).")
tf.flags.DEFINE_integer("batch_size", 1024,
"Mini-batch size for the training. Note that this "
"is the global batch size and not the per-shard batch.")
FLAGS = tf.flags.FLAGS
if FLAGS.use_tpu:
my_project_name = subprocess.check_output(['gcloud', 'config', 'get-value', 'project'])
my_zone = subprocess.check_output(['gcloud', 'config', 'get-value', 'compute/zone'])
cluster_resolver = TPUClusterResolver(
tpu=[FLAGS.tpu],
zone=my_zone,
project=my_project_name)
master = TPUClusterResolver(tpu=[os.environ['TPU_NAME']]).get_master()
else:
master = ''
my_tpu_run_config = tpu_config.RunConfig(
master=master,
model_dir=FLAGS.model_dir,
save_checkpoints_secs=FLAGS.save_checkpoints_secs,
save_summary_steps=FLAGS.save_summary_steps,
session_config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True),
tpu_config=tpu_config.TPUConfig(iterations_per_loop=FLAGS.iterations, num_shards=FLAGS.num_shards),
)
# create estimator for model (the model is described in capser_7_model_fn)
capser = tpu_estimator.TPUEstimator(model_fn=model_fn_tpu,
config=my_tpu_run_config,
use_tpu=FLAGS.use_tpu,
train_batch_size=batch_size,
params={'model_batch_size': batch_size_per_shard})
# train model
logging.getLogger().setLevel(logging.INFO) # to show info about training progress
capser.train(input_fn=train_input_fn_tpu, steps=n_steps)
I have a capsule network defined in model_fn_tpu, which returns the TPUEstimator spec. The optimizer is a standard AdamOptimizer. I have made all the changes explained here https://www.tensorflow.org/guide/using_tpu#optimizer to make my code compatible with TPUEstimator. I get the following error:
Traceback (most recent call last):
File "C:/Users/doerig/PycharmProjects/capser/TPU_playground.py", line 85, in <module>
capser.train(input_fn=train_input_fn_tpu, steps=n_steps)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 363, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 843, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 856, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 831, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\tpu\python\tpu\tpu_estimator.py", line 2016, in _model_fn
features, labels, is_export_mode=is_export_mode)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\tpu\python\tpu\tpu_estimator.py", line 1121, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\tpu\python\tpu\tpu_estimator.py", line 1317, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "C:\Users\doerig\PycharmProjects\capser\capser_7_model_fn.py", line 101, in model_fn_tpu
**output_decoder_deconv_params)
File "C:\Users\doerig\PycharmProjects\capser\capser_model.py", line 341, in capser_model
loss_training_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step(), name="training_op")
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\training\optimizer.py", line 424, in minimize
name=name)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\tpu\python\tpu\tpu_optimizer.py", line 113, in apply_gradients
summed_grads_and_vars.append((tpu_ops.cross_replica_sum(grad), var))
AttributeError: module 'tensorflow.contrib.tpu.python.ops.tpu_ops' has no attribute 'cross_replica_sum'
Any ideas to solve this problem? Thank you in advance!
I suspect this is either a bug in the version of TensorFlow you are using + Windows, or else an issue with your build of TensorFlow.
For example, when I chase down the file tensorflow\contrib\tpu\python\tpu\tpu_optimizer.py in the TF 1.4 branch, I see that tpu_ops is imported as:
from tensorflow.contrib.tpu.python.ops import tpu_ops
and if you chase that to the relevant file, you see:
if platform.system() != "Windows":
# pylint: disable=wildcard-import,unused-import,g-import-not-at-top
from tensorflow.contrib.tpu.ops.gen_tpu_ops import *
from tensorflow.contrib.util import loader
from tensorflow.python.platform import resource_loader
# pylint: enable=wildcard-import,unused-import,g-import-not-at-top
_tpu_ops = loader.load_op_library(
resource_loader.get_path_to_datafile("_tpu_ops.so"))
else:
# We have already built the appropriate libraries into the binary via CMake
# if we have built contrib, so we don't need this
pass
Following up with the other TF branches that existed at the time of this posting, we see similar comments in 1.5, in 1.6, in 1.7, in 1.8, and in 1.9.
I strongly suspect this would not occur under Linux, but I might test this later and edit this answer.
i train a model in python with keras. I save model as .h5 file and after than i use a script for .h5 file to .pb file. You can see my script:
from tensorflow.python.framework import graph_util
from tensorflow.python.framework import graph_io
from keras.models import load_model
from keras import backend as K
import os.path as osp
import os
import tensorflow as tf
model = load_model("/media/hsmnzaydn/8AD030E8D030DBDF/Projects/Machine Learning/Basic Keras/CancerDetected/modelim.h5")
nb_classes = 1 # The number of output nodes in the model
prefix_output_node_names_of_final_network = 'output_node'
K.set_learning_phase(0)
pred = [None]*nb_classes
pred_node_names = [None]*nb_classes
for i in range(nb_classes):
pred_node_names[i] = prefix_output_node_names_of_final_network+str(i)
pred[i] = tf.identity(model.output[i], name=pred_node_names[i])
print('output nodes names are: ', pred_node_names)
sess = K.get_session()
output_fld = 'tensorflow_model/'
if not os.path.isdir(output_fld):
os.mkdir(output_fld)
output_graph_name = "./" + '.pb'
output_graph_suffix = '_inference'
constant_graph = graph_util.convert_variables_to_constants(sess, sess.graph.as_graph_def(), pred_node_names)
graph_io.write_graph(constant_graph, output_fld, output_graph_name, as_text=False)
print('saved the constant graph (ready for inference) at: ', osp.join(output_fld, output_graph_name))
And i move .pb file in tensorflow root file.I try bazel command for .pb file to .lite file I use bazel command like this
bazel-bin/tensorflow/contrib/lite/toco/toco --input_file=modelim.pb --input_format=TENSORFLOW_GRAPHDEF --output_format=TFLITE --output_file=modelim.lite --inference_type=FLOAT --input_type=FLOAT --input_arrays=dense_1_input --output_arrays=output_node0 --input_shapes=1,2
but i get this error
2018-03-27 22:23:18.655997: W tensorflow/contrib/lite/toco/toco_cmdline_flags.cc:183] --input_type is deprecated. It was an ambiguous flag that set both --input_data_types and --inference_input_type. If you are trying to complement the input file with information about the type of input arrays, use --input_data_type. If you are trying to control the quantization/dequantization of real-numbers input arrays in the output file, use --inference_input_type.
2018-03-27 22:23:18.656633: I tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.cc:39] Before Removing unused ops: 17 operators, 27 arrays (0 quantized)
2018-03-27 22:23:18.656758: I tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.cc:39] Before general graph transformations: 17 operators, 27 arrays (0 quantized)
2018-03-27 22:23:18.656837: F tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc:447] Check failed: matmul_repeats * weights_shape.dims(1) == input_overall_size (0 vs. 2)
İptal edildi
Do someone know solution?