I followed the example from official tensorflow website.
https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras
Here is my spec
WSL
Ubuntu 16.04.6 LTS
Tensorflow2.0
No-GPU available
I have a file called 'tfexample.py' which looks like this
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow_datasets as tfds
import tensorflow as tf
import json, os
tfds.disable_progress_bar()
os.environ["TF_CONFIG"] = json.dumps(
{
"cluster": {"worker": ["localhost:12345", "localhost:23456"]},
"task": {"type": "worker", "index": 0},
}
)
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
BUFFER_SIZE = 10000
BATCH_SIZE = 64
def make_datasets_unbatched():
# Scaling MNIST data from (0, 255] to (0., 1.]
def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255
return image, label
datasets, info = tfds.load(name="mnist", with_info=True, as_supervised=True)
return datasets["train"].map(scale).cache().shuffle(BUFFER_SIZE)
train_datasets = make_datasets_unbatched().batch(BATCH_SIZE)
def build_and_compile_cnn_model():
model = tf.keras.Sequential(
[
tf.keras.layers.Conv2D(32, 3, activation="relu", input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation="relu"),
tf.keras.layers.Dense(10, activation="softmax"),
]
)
model.compile(
loss=tf.keras.losses.sparse_categorical_crossentropy,
optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
metrics=["accuracy"],
)
return model
# single_worker_model = build_and_compile_cnn_model()
# single_worker_model.fit(x=train_datasets, epochs=3, steps_per_epoch=5)
NUM_WORKERS = 2
# Here the batch size scales up by number of workers since
# `tf.data.Dataset.batch` expects the global batch size. Previously we used 64,
# and now this becomes 128.
GLOBAL_BATCH_SIZE = 64 * NUM_WORKERS
with strategy.scope():
# Creation of dataset, and model building/compiling need to be within
# `strategy.scope()`.
train_datasets = make_datasets_unbatched().batch(GLOBAL_BATCH_SIZE)
multi_worker_model = build_and_compile_cnn_model()
# Keras' `model.fit()` trains the model with specified number of epochs and
# number of steps per epoch. Note that the numbers here are for demonstration
# purposes only and may not sufficiently produce a model with good quality.
multi_worker_model.fit(x=train_datasets, epochs=3, steps_per_epoch=5)
When I run this file with
python tfexample.py
The terminal just hangs like below
2020-02-04 17:50:23.483411: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-02-04 17:50:23.485194: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-02-04 17:50:23.485747: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/home/danny/.local/lib/python2.7/site-packages/requests/__init__.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown.
warnings.warn(warning, RequestsDependencyWarning)
2020-02-04 17:50:29.013263: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-02-04 17:50:29.014152: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-02-04 17:50:29.014781: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (WINDOWS-6DFFM0Q): /proc/driver/nvidia/version does not exist
2020-02-04 17:50:29.015780: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-04 17:50:29.025575: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2701000000 Hz
2020-02-04 17:50:29.027050: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x66b11a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-04 17:50:29.027669: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
E0204 17:50:29.038614800 24084 socket_utils_common_posix.cc:198] check for SO_REUSEPORT: {"created":"#1580856629.038575000","description":"Protocol not available","errno":92,"file":"external/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":175,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
E0204 17:50:29.039313500 24084 socket_utils_common_posix.cc:299] setsockopt(TCP_USER_TIMEOUT) Protocol not available
2020-02-04 17:50:29.051180: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12345, 1 -> localhost:23456}
2020-02-04 17:50:29.053392: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:12345
Any help will be appreciated!
Are you running the tfexample.py on two sessions with the correct TFconfig. I haven't tried two instances on the same machine
This problem is due MultiWorkerMirroredStrategy() needs as much different physical devices as number of workers you want to run. If you want to run your script in your local machine you can run each worker in a different Docker container.
Related
I want to plot my model using Keras.utils.plot_model function. my problem is that when I plot the model, the input and output shapes do not place on top of each other and instead will be put alongside each other (like figure 1).
Here is the code to plot this model:
model = tf.keras.models.Sequential()
model.add(layers.Embedding(100, 128, input_length=45,
input_shape=(45,), name='embed'))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.MaxPooling1D(5))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(1))
plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=False)
but I like to have the model plot such as figure 2 which is the typical figure we can find on internet and I created it many times before.
I couldn't find any figsize or fontsize option in plot_model to try changing them. I use google Colaboratory Notebook.
Any help is very appreciated.
I also have the same issue and I finally found this github link.
github
Just because we're using tensorflow ver2.8.0, this problem seems to happen.
As mentioned in the link, one valid solution is to change our tensorflow version such as tf-nightly.
[tensorflow ver2.8.0]
import tensorflow as tf
tf.__version__
2.8.0
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(1,input_shape=[1], name="input_layer")
],name="model_1")
model.compile(...)
[tensorflow nightly]
!pip --quiet install tf-nightly #try not to use tf ver2.8
import tensorflow as tf
tf.__version__
2.10.0-dev20220403
#just do the same thing as above
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(1,input_shape=[1], name="input_layer")
],name="model_1")
model.compile(...)
I hope you solve this problem.
It is easy but using model sequence is more easily managed.
What are the embedded layers and dataset buffers !?
It is batches of input, you manage the combination or number of batches!
( Using MS-word draws the graphs is faster or drawing tools, I use free office when study )
[ Codes ]:
import tensorflow as tf
from tensorflow.keras.utils import plot_model
model = tf.keras.models.Sequential([
tf.keras.layers.InputLayer(input_shape=(100,), dtype='int32', name='input'),
tf.keras.layers.Embedding(output_dim=512, input_dim=100, input_length=100),
tf.keras.layers.LSTM(32),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid', name='output'),
])
dot_img_file = 'F:\\temp\\Python\\img\\001.png'
tf.keras.utils.plot_model(model, to_file=dot_img_file, show_shapes=True)
# <IPython.core.display.Image object>
input('...')
[ Output ]:
F:\temp\Python>python test_tf_plotgraph.py
2022-03-28 14:21:26.043715: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-28 14:21:26.645113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 4565 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1
...
...
I trained a custom MobileNetV2 SSD model for object detection. I saved the .pb file and now I want to convert it into a .tflite-file in order to use it with Coral edge-tpu.
I use Tensorflow 2.2 on Windows 10 on CPU.
The code I'm using:
import tensorflow as tf
saved_model_dir = r"C:/Tensorflow/Backup_Training/my_MobileNetV2_fpnlite_320x320/saved_model"
num_calibration_steps = 100
def representative_dataset_gen():
for _ in range(num_calibration_steps):
# Get sample input data as a numpy array
yield [np.random.uniform(0.0, 1.0, size=(1,416,416, 3)).astype(np.float32)]
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.experimental_new_converter = True
converter.representative_dataset = representative_dataset_gen
converter.target_spec.supported_ops = [
#tf.lite.OpsSet.TFLITE_BUILTINS_INT8,
tf.lite.OpsSet.TFLITE_BUILTINS,
tf.lite.OpsSet.SELECT_TF_OPS
]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_quant_model = converter.convert()
with open('model.tflite', 'wb') as f:
f.write(tflite_model)
I tried several proposed solutions from other threads and I also tried it with tf-nightly, tf2.3 & tf1.14 but none of it worked (there was always another error message I couldn't handle). Since I trained with tf2.2 I thought it might be a good idea to proceed with tf2.2.
Since I'm new to Tensorflow I have several questions: what exactly is the input tensor and where do I define it? Is there a possibility to see or extract this input tensor?
Does anybody know how to fix this issue?
The whole error message:
(tf22) C:\Tensorflow\Backup_Training>python full_int_quant.py
2020-10-22 14:51:20.460948: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2020-10-22 14:51:20.466366: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-10-22 14:51:29.231404: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'nvcuda.dll'; dlerror: nvcuda.dll not found
2020-10-22 14:51:29.239003: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2020-10-22 14:51:29.250497: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: ip3536
2020-10-22 14:51:29.258432: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: ip3536
2020-10-22 14:51:29.269261: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-10-22 14:51:29.291457: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2ae2ac3ffc0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-22 14:51:29.298043: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-10-22 14:52:03.785341: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-10-22 14:52:03.790251: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-10-22 14:52:04.559832: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:797] Optimization results for grappler item: graph_to_optimize
2020-10-22 14:52:04.564529: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799] function_optimizer: Graph size after: 3672 nodes (3263), 5969 edges (5553), time = 136.265ms.
2020-10-22 14:52:04.570187: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799] function_optimizer: function_optimizer did nothing. time = 2.637ms.
2020-10-22 14:52:10.742013: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-10-22 14:52:10.746868: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-10-22 14:52:12.358897: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:797] Optimization results for grappler item: graph_to_optimize
2020-10-22 14:52:12.363657: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799] constant_folding: Graph size after: 1714 nodes (-1958), 2661 edges (-3308), time = 900.347ms.
2020-10-22 14:52:12.369137: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799] constant_folding: Graph size after: 1714 nodes (0), 2661 edges (0), time = 60.628ms.
Traceback (most recent call last):
File "full_int_quant.py", line 40, in <module>
tflite_model = converter.convert()
File "C:\Users\schulzyk\Anaconda3\envs\tf22\lib\site-packages\tensorflow\lite\python\lite.py", line 480, in convert
raise ValueError(
ValueError: None is only supported in the 1st dimension. Tensor 'input_tensor' has invalid shape '[1, None, None, 3]'.
Whatever I change in the code, there is always the same error message. I don't know if this is a sign that during training something went wrong but there were no eye-catching occurences.
I'd be happy for any kind of feedback!
Ahh, object detection API with tensorflow 2.0 for coral is still a WIP. We are having many roadblocks and may not see this feature soon. I suggest using tf1.x aPI for now. Here is a good tutorial :)
https://github.com/Namburger/edgetpu-ssdlite-mobiledet-retrain
Im trying to use multiple GPU(2) in Keras, using MirroredStrategy() from Tensorflow. However it leads to the following error:
Epoch 1/5
WARNING:tensorflow:From /home/user/conda36/lib/python3.6/site-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py:601: get_next_as_optional (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Iterator.get_next_as_optional()` instead.
2020-09-06 18:50:54.766930: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-06 18:50:55.069925: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-06 18:50:56.049400: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at depthwise_conv_op.cc:386 : Invalid argument: Computed output size would be negative: -1 [input_size: 3, effective_filter_size: 5, stride: 1]
At this point the model gets frozen. That means it still running but it does nothing.
If I run it without MirroredStrategy() it works totally fine, but of course it only uses 1 GPU.
I use MirroredStrategy() like this:
def get_model():
.
.
.
decoded = Conv2D(1, (3, 3), activation='linear', padding='same')(d)
autoencoder = Model(input_img, decoded)
autoencoder.summary()
autoencoder.compile(optimizer='Adagrad', loss='mean_squared_error')
return autoencoder
if __name__ == '__main__':
# Model
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
autoencoder = get_model()
What could be the error?
Running AWS SageMaker with a custom model, the TrainingJob fails with an Algorithm Error when using Keras plus a Tensorflow backend in multi-gpu configuration:
from keras.utils import multi_gpu_model
parallel_model = multi_gpu_model(model, gpus=K)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
parallel_model.fit(x, y, epochs=20, batch_size=256)
This simple parallel model loading will fail. There is no further error or exception from CloudWatch logging. This configuration works properly on local machine with 2x NVIDIA GTX 1080, same Keras Tensorflow backend.
According to SageMaker documentation and tutorials the multi_gpu_model utility will work ok when Keras backend is MXNet, but I did not find any mention when the backend is Tensorflow with the same multi gpu configuration.
[UPDATE]
I have updated the code with the suggested answer below, and I'm adding some logging before the TrainingJob hangs
This logging repeats twice
2018-11-27 10:02:49.878414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2018-11-27 10:02:49.878462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-27 10:02:49.878471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2018-11-27 10:02:49.878477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y
2018-11-27 10:02:49.878481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y
2018-11-27 10:02:49.878486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y
2018-11-27 10:02:49.878492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N
2018-11-27 10:02:49.879340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 14874 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1b.0, compute capability: 7.0)
2018-11-27 10:02:49.879486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:1 with 14874 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1c.0, compute capability: 7.0)
2018-11-27 10:02:49.879694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:2 with 14874 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1d.0, compute capability: 7.0)
2018-11-27 10:02:49.879872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:3 with 14874 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
Before there is some logging info about each GPU, that repeats 4 times
2018-11-27 10:02:46.447639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.78GiB freeMemory: 15.37GiB
According to the logging all the 4 GPUs are visible and loaded in the Tensorflow Keras backend. After that no application logging follows, the TrainingJob status is inProgress for a while, after that it becomes Failed with the same Algorithm Error.
Looking at CloudWatch logging I can see some metrics at work. Specifically GPU Memory Utilization, CPU Utilization are ok, while GPU utilization is 0%.
[UPDATE]
Due to a known bug on Keras that is about saving a multi gpu model, I'm using this override of the multi_gpu_model utility in keras.utils
from keras.layers import Lambda, concatenate
from keras import Model
import tensorflow as tf
def multi_gpu_model(model, gpus):
#source: https://github.com/keras-team/keras/issues/8123#issuecomment-354857044
if isinstance(gpus, (list, tuple)):
num_gpus = len(gpus)
target_gpu_ids = gpus
else:
num_gpus = gpus
target_gpu_ids = range(num_gpus)
def get_slice(data, i, parts):
shape = tf.shape(data)
batch_size = shape[:1]
input_shape = shape[1:]
step = batch_size // parts
if i == num_gpus - 1:
size = batch_size - step * i
else:
size = step
size = tf.concat([size, input_shape], axis=0)
stride = tf.concat([step, input_shape * 0], axis=0)
start = stride * i
return tf.slice(data, start, size)
all_outputs = []
for i in range(len(model.outputs)):
all_outputs.append([])
# Place a copy of the model on each GPU,
# each getting a slice of the inputs.
for i, gpu_id in enumerate(target_gpu_ids):
with tf.device('/gpu:%d' % gpu_id):
with tf.name_scope('replica_%d' % gpu_id):
inputs = []
# Retrieve a slice of the input.
for x in model.inputs:
input_shape = tuple(x.get_shape().as_list())[1:]
slice_i = Lambda(get_slice,
output_shape=input_shape,
arguments={'i': i,
'parts': num_gpus})(x)
inputs.append(slice_i)
# Apply model on slice
# (creating a model replica on the target device).
outputs = model(inputs)
if not isinstance(outputs, list):
outputs = [outputs]
# Save the outputs for merging back together later.
for o in range(len(outputs)):
all_outputs[o].append(outputs[o])
# Merge outputs on CPU.
with tf.device('/cpu:0'):
merged = []
for name, outputs in zip(model.output_names, all_outputs):
merged.append(concatenate(outputs,
axis=0, name=name))
return Model(model.inputs, merged)
This works ok on local 2x NVIDIA GTX 1080 / Intel Xeon / Ubuntu 16.04. It will fails on SageMaker Training Job.
I have posted this issue on AWS Sagemaker forum in
TrainingJob custom algorithm with Keras backend and multi GPU
SageMaker Fails when using Multi-GPU with
keras.utils.multi_gpu_model
[UPDATE]
I have slightly modified the tf.session code adding some initializers
with tf.Session() as session:
K.set_session(session)
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
and now at least I can see that one GPU (I assume device gpu:0) is used from the instance metrics. The multi-gpu does not work anyways.
This might not be the best answer for your problem, but this is what I am using for a multi-gpu model with Tensorflow backend. First i initialize using:
def setup_multi_gpus():
"""
Setup multi GPU usage
Example usage:
model = Sequential()
...
multi_model = multi_gpu_model(model, gpus=num_gpu)
multi_model.fit()
About memory usage:
https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory
"""
import tensorflow as tf
from keras.utils.training_utils import multi_gpu_model
from tensorflow.python.client import device_lib
# IMPORTANT: Tells tf to not occupy a specific amount of memory
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True # dynamically grow the memory used on the GPU
sess = tf.Session(config=config)
set_session(sess) # set this TensorFlow session as the default session for Keras.
# getting the number of GPUs
def get_available_gpus():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
num_gpu = len(get_available_gpus())
print('Amount of GPUs available: %s' % num_gpu)
return num_gpu
Then i call
# Setup multi GPU usage
num_gpu = setup_multi_gpus()
and create a model.
...
After which you're able to make it a multi GPU model.
multi_model = multi_gpu_model(model, gpus=num_gpu)
multi_model.compile...
multi_model.fit...
The only thing here that is different from what you are doing is the way Tensorflow is initializing the GPU's. I can't imagine it being the problem, but it might be worth trying out.
Good luck!
Edit: I noticed sequence to sequence not being able to work with multi GPU. Is that the type of model you are trying to train?
I apologize for the slow response.
It seems there are a lot of threads that are running in parallel, and I want to link them together, so that other individuals who have the same issue can see the progress and discussion going on.
https://forums.aws.amazon.com/thread.jspa?messageID=881541
https://forums.aws.amazon.com/thread.jspa?messageID=881540
https://github.com/aws/sagemaker-python-sdk/issues/512
There a few questions in regards to this.
What version of TensorFlow and Keras?
I am not too sure what is causing this problem. Does your container have all of the needed dependencies such as CUDA and etc? https://www.tensorflow.org/install/gpu
Were you able to train using single GPU with Keras?
I'm trying to draw mini-batches from cifar10 binary files.
When implementing my code shown below (see [source code]), the machine (python 3.6) keeps showing the message (see [console]) ans stops.
Does anyone can tell me what is the problem of my source code?
P.S. I'm new to tensorflow..
[source code]-------------------------------------------------------------
import tensorflow as tf
import numpy as np
import os
import matplotlib.pyplot as plt
def _get_image_and_label():
# directory where binary files are stored
data_dir = '/tmp/cifar10_data/cifar-10-batches-bin'
# Step1) make filename Queue
filenames = [os.path.join(data_dir, 'data_batch_%d.bin' % i) for i in range(1, 6)]
filename_queue = tf.train.string_input_producer(filenames)
# Step2) read files
label_bytes = 1 # 2 for CIFAR-100
height = 32
width = 32
depth = 3
image_bytes = height * width * depth
record_bytes = label_bytes + image_bytes
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
key, value = reader.read(filename_queue)
# Step3) decode the file in a unit of 1 byte
record_bytes = tf.decode_raw(value, tf.uint8)
# The first bytes represent the label, which we convert from uint8->int32.
label = tf.cast(tf.strided_slice(record_bytes, [0], [label_bytes]), tf.int32)
# The remaining bytes after the label represent the image, which we reshape from [depth * height * width] to [depth, height, width].
depth_major = tf.reshape(tf.strided_slice(record_bytes, [label_bytes], [label_bytes + image_bytes]),
[depth, height, width])
# Convert from [depth, height, width] to [height, width, depth].
uint8image = tf.transpose(depth_major, [1, 2, 0])
# set shape ( image: tf.float32, label: tf.int32 )
image = tf.cast(uint8image, tf.float32)
image.set_shape([height, width, 3])
label.set_shape([1])
# collect batch from the files
# train_x_batch, train_y_batch = tf.train.batch([image, label], batch_size=1)
# return train_x_batch, train_y_batch
return image, label
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
image, label = _get_image_and_label()
for i in range(10):
image_batch, lable_batch = tf.train.batch([image, label], batch_size=1)
image_batch_uint8 = tf.cast(image_batch, tf.uint8)
final_image = sess.run(image_batch_uint8)
imgplot = plt.imshow(final_image[0])
coord.request_stop()
coord.join(threads)
sess.close()
[Console]-----------------------------------------------------------------/home/dooseop/anaconda3/bin/python /home/dooseop/pycharm-community-2016.3.3/helpers/pydev/pydevd.py --multiproc --qt-support --client 127.0.0.1 --port 40623 --file /home/dooseop/PycharmProjects/Tensorflow/CIFAR10_main.py
warning: Debugger speedups using cython not found. Run '"/home/dooseop/anaconda3/bin/python" "/home/dooseop/pycharm-community-2016.3.3/helpers/pydev/setup_cython.py" build_ext --inplace' to build.
Connected to pydev debugger (build 163.15188.4)
pydev debugger: process 10992 is connecting
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.683
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.17GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
Sigkill only happens when someone explicitly kills the program. The problem here is that start_queue_runners is being called before the queue runners are created (as they are created by tf.train.batch). Also for better performance build the graph once and run it in a loop, as in:
image, label = _get_image_and_label()
image_batch, lable_batch = tf.train.batch([image, label], batch_size=1)
image_batch_uint8 = tf.cast(image_batch, tf.uint8)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
for i in range(10):
final_image = sess.run(image_batch_uint8)
imgplot = plt.imshow(final_image[0])
coord.request_stop()
coord.join(threads)