Given a code which uses multiple graphs or multiple versions of the same graph, it is sometimes necessary to ensure that a particular graph uses only CPU for computation, while some other graph uses only GPU.
The basic question is
How to make sure that a particular graph makes use of only CPU (XOR) GPU for computations ?
There is not an exhaustive discussion of this topic on SO and hence this question.
I have tried a number of different approaches and none seem to work as will be outlined below.
Before further details on the question and various options that have been tried, I will lay down following details :-
TensorFlow Version : 'v1.1.0-rc2-1003-g3792dd9' 1.1.0-rc2 (Compiled from source)
OS details : CentOS Linux release 7.2.1511 (Core)
Bazel version : 0.4.5
Basic code with which various approaches have been tried is mentioned below :
import tensorflow as tf
from tensorflow.python.client import timeline
import matplotlib.pyplot as plt
def coloraugment(image):
output = tf.image.random_brightness(image, max_delta=10./255.)
output = tf.clip_by_value(output, 0.0, 1.0)
output = tf.image.random_saturation(output, lower=0.5, upper=1.5)
output = tf.clip_by_value(output, 0.0, 1.0)
output = tf.image.random_contrast(output, lower=0.5, upper=1.5)
output = tf.clip_by_value(output, 0.0, 1.0)
return output
def augmentbody(image, sz):
for i in range(10):
if i == 0:
cropped = tf.random_crop(value=image, size=sz)
croppedflipped = tf.image.flip_left_right(cropped)
out = tf.stack([cropped, croppedflipped], axis=0)
else:
cropimg = tf.random_crop(value=image, size=sz)
augcolor = coloraugment(cropimg)
augflipped = tf.image.flip_left_right(augcolor)
coll = tf.stack([augcolor, augflipped], axis=0)
out = tf.concat([coll, out], axis=0)
out = tf.random_shuffle(out)
return out
def aspect1(aspectratio):
newheight = tf.constant(256, dtype=tf.float32)
newwidth = tf.divide(newheight, aspectratio)
newsize = tf.stack([newheight, newwidth], axis=0)
newsize = tf.cast(newsize, dtype=tf.int32)
return newsize
def aspect2(aspectratio):
newwidth = tf.constant(256, dtype=tf.float32)
newheight = tf.multiply(newwidth, aspectratio)
newsize = tf.stack([newheight, newwidth], axis=0)
newsize = tf.cast(newsize, dtype=tf.int32)
return newsize
def resize_image(image):
imageshape = tf.shape(image)
imageheight = tf.cast(tf.gather(imageshape, tf.constant(0, dtype=tf.int32)),
dtype=tf.float32)
imagewidth = tf.cast(tf.gather(imageshape, tf.constant(1, dtype=tf.int32)),
dtype=tf.float32)
aspectratio = tf.divide(imageheight, imagewidth)
newsize = tf.cond(tf.less_equal(imageheight, imagewidth),
lambda: aspect1(aspectratio),
lambda: aspect2(aspectratio))
image = tf.image.convert_image_dtype(image, dtype=tf.float32)
image = tf.image.resize_images(image, newsize)
return image
def readimage(file_queue):
reader = tf.WholeFileReader()
key, value = reader.read(file_queue)
image = tf.image.decode_jpeg(value)
image = resize_image(image)
return image
if __name__ == "__main__":
queue = tf.train.string_input_producer(["holly2.jpg"])
image = readimage(queue)
augmented = augmentbody(image, [221,221,3])
init_op = tf.global_variables_initializer()
config_cpu = tf.ConfigProto()
config = tf.ConfigProto(
device_count = {'GPU': 0}
)
sess_cpu = tf.Session(config=config)
with tf.Session(config=config_cpu) as sess:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
sess.run(init_op)
[tense] = sess.run([augmented],options=run_options, run_metadata=run_metadata)
coord.request_stop()
coord.join(threads)
tl = timeline.Timeline(run_metadata.step_stats)
ctf = tl.generate_chrome_trace_format()
with open('timeline.json', 'w') as f:
f.write(ctf)
print("The tensor size is {}".format(tense.shape))
numcols = tense.shape[0]/2
for i in range(tense.shape[0]):
plt.subplot(2,numcols,i+1)
plt.imshow(tense[i, :, :, :])
plt.show()
plt.close()
Various approaches which have been tried
Various related questions exist on SO with accepted answers, but they
do not seem to work very well as I outline with examples and outputs.
Various tried approaches
Approach 1
Related question is ( Run Tensorflow on CPU ). The accepted answer is to run tf.Session() with the following configuration :
config = tf.ConfigProto(
device_count = {'GPU': 0}
)
sess = tf.Session(config=config)
The corresponding output is :
2017-05-18 13:34:27.477189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7715
pciBusID 0000:04:00.0
Total memory: 7.92GiB
Free memory: 7.80GiB
2017-05-18 13:34:27.477232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:927] DMA: 0
2017-05-18 13:34:27.477240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:937] 0: Y
2017-05-18 13:34:27.477259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
2017-05-18 13:34:27.482600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
2017-05-18 13:34:27.848864: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
2017-05-18 13:34:27.848902: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 40 visible devices
2017-05-18 13:34:27.851670: I tensorflow/compiler/xla/service/service.cc:184] XLA service 0x7f0fd81d5500 executing computations on platform Host. Devices:
2017-05-18 13:34:27.851688: I tensorflow/compiler/xla/service/service.cc:192] StreamExecutor device (0): <undefined>, <undefined>
2017-05-18 13:34:27.851894: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
2017-05-18 13:34:27.851903: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 40 visible devices
2017-05-18 13:34:27.854698: I tensorflow/compiler/xla/service/service.cc:184] XLA service 0x7f0fd82b4c50 executing computations on platform CUDA. Devices:
2017-05-18 13:34:27.854713: I tensorflow/compiler/xla/service/service.cc:192] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2017-05-18 13:34:28.918980: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
You can clearly see that the GPU is still being used and the XLA
service is running on GPU
Approach 2
Related question is ( Run Tensorflow on CPU ). This answer states that the following environment variable can be set as follows to use CPU
CUDA_VISIBLE_DEVICES=""
When the GPU computation is required, it can be unset.
The corresponding output is
2017-05-18 13:42:24.871020: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2017-05-18 13:42:24.871071: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: nefgpu12
2017-05-18 13:42:24.871081: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: nefgpu12
2017-05-18 13:42:24.871114: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 367.48.0
2017-05-18 13:42:24.871147: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.48 Sat Sep 3 18:21:08 PDT 2016
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
"""
2017-05-18 13:42:24.871170: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 367.48.0
2017-05-18 13:42:24.871178: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 367.48.0
2017-05-18 13:42:25.159632: W tensorflow/compiler/xla/service/platform_util.cc:61] platform CUDA present but no visible devices found
2017-05-18 13:42:25.159674: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 40 visible devices
2017-05-18 13:42:25.162626: I tensorflow/compiler/xla/service/service.cc:184] XLA service 0x7f5798002df0 executing computations on platform Host. Devices:
2017-05-18 13:42:25.162663: I tensorflow/compiler/xla/service/service.cc:192] StreamExecutor device (0): <undefined>, <undefined>
2017-05-18 13:42:25.223309: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
You can see from this output that the GPU is not being used.
Approach 3
The related question is ( Running multiple graphs in different device modes in TensorFlow ). One answer gives the following solution :
# The config for CPU usage
config_cpu = tf.ConfigProto()
config_cpu.gpu_options.visible_device_list=''
sess_cpu = tf.Session(config=config_cpu)
# The config for GPU usage
config_gpu = tf.ConfigProto()
config_gpu.gpu_options.visible_device_list='0'
sess_gpu = tf.Session(config=config_gpu)
The output of using the configuration for CPU usage as outlined in the solution is as follows :
2017-05-18 13:50:32.999431: I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7715
pciBusID 0000:04:00.0
Total memory: 7.92GiB
Free memory: 7.80GiB
2017-05-18 13:50:32.999472: I tensorflow/core/common_runtime/gpu/gpu_device.cc:927] DMA: 0
2017-05-18 13:50:32.999478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:937] 0: Y
2017-05-18 13:50:32.999490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
2017-05-18 13:50:33.084737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
2017-05-18 13:50:33.395798: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
2017-05-18 13:50:33.395837: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 40 visible devices
2017-05-18 13:50:33.398634: I tensorflow/compiler/xla/service/service.cc:184] XLA service 0x7f08181ecfa0 executing computations on platform Host. Devices:
2017-05-18 13:50:33.398695: I tensorflow/compiler/xla/service/service.cc:192] StreamExecutor device (0): <undefined>, <undefined>
2017-05-18 13:50:33.398908: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
2017-05-18 13:50:33.398920: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 40 visible devices
2017-05-18 13:50:33.401731: I tensorflow/compiler/xla/service/service.cc:184] XLA service 0x7f081821e1f0 executing computations on platform CUDA. Devices:
2017-05-18 13:50:33.401745: I tensorflow/compiler/xla/service/service.cc:192] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2017-05-18 13:50:34.484142: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
You can see that the GPU is still being used.
See issues #9201 and #2175. The fact that the GPU devices are created does not mean that your graph is necessarily running on the GPU. You can enforce CPU execution with device_count = {'GPU': 0} or tf.device, but the GPU devices are still created with the session, just in case some op wants it. About 'CUDA_VISIBLE_DEVICES', making it empty did not work for me either, but doing export CUDA_VISIBLE_DEVICES"-1" (before starting Python, or inside Python through os.environ before importing TensorFlow) did the trick (TensorFlow will output a warning about the GPU not being found, but it will work). You can see the documentation for CUDA_VISIBLE_DEVICES here.
Related
I am using a cluster which has 8*2080TI (11gb each) for distributed deeplearning. My goal is to utilize all the GPUs for training the model. My code uses MPI to gather all the process accross the cluster and tries to distribute accross all the workers. But this is giving me an error. I am using python 3.9 and TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1.
I am currently lost and dont know what i need to do in this case. I tried installing other tenorflow versions using conda but it ends up like this.
slurm file :
#!/bin/bash
#SBATCH --job-name=job1 # Job name
#SBATCH --mem=30000 # Job memory request
#SBATCH --gres=gpu:4 # Number of requested GPU(s)
#SBATCH --time=3-23:00:00 # Time limit days-hrs:min:sec
#SBATCH --constraint=rtx_2080 # Specific hardware constraint
#SBATCH --error=slurm.err # Error file name
#SBATCH --output=slurm.out # Output file name
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --array=1-2%1
if [ -d "model-final" ]
then
scancel $SLURM_ARRAY_JOB_ID
else
module load Anaconda3/2020.07
module load TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1
mpirun python -u main.py resume_latest
fi
my error:
Instructions for updating:
use distribute.MultiWorkerMirroredStrategy instead
2023-01-18 13:18:35.789808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9687 MB memory: -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:3d:00.0, compute capability: 7.5
2023-01-18 13:18:35.790848: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9687 MB memory: -> device: 1, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:3e:00.0, compute capability: 7.5
2023-01-18 13:18:35.791743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 9687 MB memory: -> device: 2, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:88:00.0, compute capability: 7.5
2023-01-18 13:18:35.792678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 9687 MB memory: -> device: 3, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:89:00.0, compute capability: 7.5
2023-01-18 13:18:35.804893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:worker/replica:0/task:0/device:GPU:0 with 9687 MB memory: -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:3d:00.0, compute capability: 7.5
2023-01-18 13:18:35.805620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:worker/replica:0/task:0/device:GPU:1 with 9687 MB memory: -> device: 1, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:3e:00.0, compute capability: 7.5
2023-01-18 13:18:35.806333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:worker/replica:0/task:0/device:GPU:2 with 9687 MB memory: -> device: 2, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:88:00.0, compute capability: 7.5
2023-01-18 13:18:35.807029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:worker/replica:0/task:0/device:GPU:3 with 9687 MB memory: -> device: 3, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:89:00.0, compute capability: 7.5
2023-01-18 13:18:35.810512: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> g01:37672}
2023-01-18 13:18:35.810736: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://g01:37672
/usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/keras/optimizer_v2/optimizer_v2.py:355: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.
warnings.warn(
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
2023-01-18 13:18:42.547198: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:695] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_2"
op: "TensorSliceDataset"
input: "Placeholder/_0"
input: "Placeholder/_1"
attr {
key: "Toutput_types"
value {
list {
type: DT_DOUBLE
type: DT_DOUBLE
}
}
}
attr {
key: "output_shapes"
value {
list {
shape {
dim {
size: 15
}
}
shape {
dim {
size: 13
}
}
}
}
}
2023-01-18 13:18:42.740015: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
[g01:44037:0:44313] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 44313) ====
0 0x000000000002137e ucs_debug_print_backtrace() /umbc/ebuild-soft/cascade-lake/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucs/debug/debug.c:656
1 0x000000000382045b tensorflow::NcclCommunicator::Enqueue() collective_communicator.cc:0
2 0x0000000005c9f88a tensorflow::NcclReducer::Run() ???:0
3 0x00000000009086dc tensorflow::BaseCollectiveExecutor::ExecuteAsync(tensorflow::OpKernelContext*, tensorflow::CollectiveParams const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (tensorflow::Status const&)>)::{lambda()#3}::operator()() base_collective_executor.cc:0
4 0x0000000000b99403 tensorflow::UnboundedWorkQueue::PooledThreadFunc() ???:0
5 0x0000000000b9f6b1 tensorflow::(anonymous namespace)::PThread::ThreadFn() env.cc:0
6 0x0000000000007ea5 start_thread() pthread_create.c:0
7 0x00000000000feb0d __clone() ???:0
=================================
[g01:44037] *** Process received signal ***
[g01:44037] Signal: Segmentation fault (11)
[g01:44037] Signal code: (-6)
[g01:44037] Failing at address: 0x2ecf70000ac05
[g01:44037] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2aaaab7e6630]
[g01:44037] [ 1] /usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x382045b)[0x2aaab68fc45b]
[g01:44037] [ 2] /usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow11NcclReducer3RunESt8functionIFvRKNS_6StatusEEE+0x1ca)[0x2aaab8d7b88a]
[g01:44037] [ 3] /usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x9086dc)[0x2aaadc7556dc]
[g01:44037] [ 4] /usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow18UnboundedWorkQueue16PooledThreadFuncEv+0x1b3)[0x2aaadc9e6403]
[g01:44037] [ 5] /usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0xb9f6b1)[0x2aaadc9ec6b1]
[g01:44037] [ 6] /lib64/libpthread.so.0(+0x7ea5)[0x2aaaab7deea5]
[g01:44037] [ 7] /lib64/libc.so.6(clone+0x6d)[0x2aaaac468b0d]
[g01:44037] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 44037 on node g01 exited on signal 11 (Segmentation fault).
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
# Load in the parameter files
from json import load as loadf
with open("params.json", 'r') as inFile:
params = loadf(inFile)
# Get data files and prep them for the generator
from tensorflow import distribute as D
callbacks = []
devices = getDevices()
print(devices)
set_tf_config_mpi()
strat = D.experimental.MultiWorkerMirroredStrategy(
communication=D.experimental.CollectiveCommunication.NCCL)
# Create network
from sys import argv
resume_training = False
print(argv)
if "resume_latest" in argv:
resume_training = True
with strat.scope():
# Scheduler
if isinstance(params["learning_rate"], str):
# Get the string for the importable function
lr = params["learning_rate"]
from tensorflow.keras.callbacks import LearningRateScheduler
# Use a dummy learning rate
params["learning_rate"] = 0.1
# model = create_model(**params)
# Get the importable function
lr = lr.split(".")
baseImport = __import__(lr[0], globals(), locals(), [lr[1]], 0)
lr = getattr(baseImport, lr[1])
# Make a schedule
lr = LearningRateScheduler(lr)
callbacks.append(lr)
# Resume Model?
model_name = None
if resume_training:
initial_epoch, model_name = getInitialEpochsAndModelName(rank)
if model_name is None:
initial_epoch=0
model = create_model(**params)
resume_training = False
else:
from tensorflow.keras.models import load_model
model = load_model(model_name)
# Load data from disk
import numpy
if "root" in params.keys():
root = params['root']
else:
root = "./"
if "filename" in params.keys():
filename = params["filename"]
else:
filename = "dataset_timeseries.csv"
restricted = [
'euc1', 'e1', 'x1', 'y1', 'z1',
'euc2', 'e2', 'x2', 'y2', 'z2',
'euc3', 'e3', 'x3', 'y3', 'z3',
]
x, y = getOneHot("{}/{}".format(root, filename), restricted=restricted, **params)
# val_x, val_y = getOneHot("{}/{}".format(root, val_filename), restricted=restricted)
val_x, val_y = None, None
params["gbatch_size"] = params['batch_size'] * len(devices)
print("x.shape =", x.shape)
print("y.shape =", y.shape)
print("epochs =", params['epochs'], type(params['epochs']))
print("batch =", params['batch_size'], type(params['batch_size']))
print("gbatch =", params['gbatch_size'], type(params['gbatch_size']))
# Load data into a distributed dataset
# Dataset object does nothing in place:
# https://stackoverflow.com/questions/55645953/shape-of-tensorflow-dataset-data-in-keras-tensorflow-2-0-is-wrong-after-conver
from tensorflow.data import Dataset
data = Dataset.from_tensor_slices((x, y))
# Create validation set
v = params['validation']
if val_x is not None:
vrecord = val_x.shape[0]
val = Dataset.from_tensor_slices((val_x, val_y))
validation = val # data.take(vrecord)
else:
vrecord = int(x.shape[0]*v)
validation = data.take(vrecord)
validation = validation.batch(params['gbatch_size'])
validation = validation.repeat(params['epochs'])
# Validation -- need to do kfold one day
# This set should NOT be distributed
vsteps = vrecord // params['gbatch_size']
if vrecord % params['gbatch_size'] != 0:
vsteps += 1
# Shuffle the data during preprocessing or suffer...
# Parallel randomness == nightmare
# data = data.shuffle(x.shape[0])
# Ordering these two things is very important!
# Consider 3 elements, batch size 2 repeat 2
# [1 2 3] -> [[1 2] [3]] -> [[1 2] [3] [1 2] [3]] (correct) batch -> repeat
# [1 2 3] -> [1 2 3 1 2 3] -> [[1 2] [3 1] [2 3]] (incorrect) repeat -> batch
# data = data.skip(vrecord)
data = data.batch(params['gbatch_size'])
data = data.repeat(params['epochs'])
records = x.shape[0] # - vrecord
steps = records // params['gbatch_size']
if records % params['gbatch_size']:
steps += 1
print("steps =", steps)
# Note that if we are resuming that the number of _remaining_ epochs has
# changed!
# The number of epochs * steps is the numbers of samples to drop
print("initial cardinality = ", data.cardinality())
print("initial v cardinality = ", data.cardinality())
data = data.skip(initial_epoch*steps)
validation = validation.skip(initial_epoch*vsteps)
print("final cardinality = ", data.cardinality())
print("final v cardinality = ", data.cardinality())
# data = strat.experimental_distribute_dataset(data)
# Split into validation and training
callbacks = createCallbacks(params, callbacks, rank, resume_training)
print(callbacks)
history = model.fit(data, epochs=params['epochs'],
batch_size=params['gbatch_size'],
steps_per_epoch=steps,
verbose=0,
initial_epoch=initial_epoch,
validation_data=validation,
validation_steps=vsteps,
callbacks=callbacks)
if rank == 0:
model.save("model-final")
else:
model.save("checkpoints/model-tmp")
Version I used:
python 3.6.5
mxnet 1.5.0
cuda 9.2 (I also installed cuda 11.4 and cudnn 8.2.4 because I checked cmd and my NVIDIA used it)
cudnn 7.6.5
window10 64bit
Question:
I used mxnet and gluoncv for image segmentation and gpu problem occured consistently.
I install and uninstall almost every cuda versions(and cudnns) but it didn't help.
plus, I'm little confused that should I use mxnet-cu92 or something else?
when I first installed cuda 11.4, I installed mxnet-cu101(mxnet-cu112 didn't work for me)
but I found cu92 is for using gpu so I installed it again with cuda9.2.
and still not working
here is my code
ctx = mx.gpu(0)
model = gluoncv.model_zoo.get_model('fcn_resnet50_ade', pretrained=True, ctx=ctx) #deeplab_resnet101_ade #fcn_resnet50_ade
total_df = pd.DataFrame(columns=ADE20KSegmentation.CLASSES)
start = time.time()
Moly = []
Fences = {}
for i in range(len(image_file)):
if i%100==0:
print(i)
print(time.time()-start)
start = time.time()
img = mx.image.imread(image_file[i])
image = test_transform(mx.img.imresize(img, 1200, 1200), ctx)
output_array = model.predict(image)
predict_index = mx.nd.argmax(output_array,1).asnumpy()
holy = find_fence(predict_index)
Moly.append(holy)
flat = predict_index.flatten()
output_dict = {}
for index, cls in enumerate(ADE20KSegmentation.CLASSES):
num_pixel = len(np.where(flat==index)[0])
output_dict[cls] = round(num_pixel/1440000, 4)
total_df = total_df.append(output_dict, ignore_index=True)
for names, holy in zip(image_names, Moly):
Fences[names] = holy
and I got "MXNetError: C:\Jenkins\workspace\mxnet-tag\mxnet\src\ndarray\ndarray.cc:1285: GPU is not enabled" this error on
model = gluoncv.model_zoo.get_model('fcn_resnet50_ade', pretrained=True, ctx=ctx)
this code.
what should I do now...?
def create_hparams():
return trainer_lib.create_hparams(
FLAGS.hparams_set,
FLAGS.hparams,
data_dir=os.path.expanduser(FLAGS.data_dir),
problem_name=FLAGS.problem)
def create_decode_hparams():
decode_hp = decoding.decode_hparams(FLAGS.decode_hparams)
decode_hp.shards = FLAGS.decode_shards
decode_hp.shard_id = FLAGS.worker_id
decode_in_memory = FLAGS.decode_in_memory or decode_hp.decode_in_memory
decode_hp.decode_in_memory = decode_in_memory
decode_hp.decode_to_file = FLAGS.decode_to_file
decode_hp.decode_reference = FLAGS.decode_reference
return decode_hp
hp = create_hparams()
decode_hp = create_decode_hparams()
run_conf = t2t_trainer.create_run_config(hp)
estimator = trainer_lib.create_estimator(
FLAGS.model,
hp,
run_conf,
decode_hparams=decode_hp,
use_tpu=FLAGS.use_tpu)
print(run_conf.session_config)
def input_fn():
inputs = tf.placeholder(tf.int32, shape=(1, None, 1, 1), name="inputs")
input_tensor = {'inputs': inputs }
return tf.estimator.export.ServingInputReceiver(input_tensor, input_tensor)
predictor=tf.contrib.predictor.from_estimator(estimator, input_fn)
I got output of
InvalidArgumentError: Cannot assign a device for operation
transformer/body/parallel_0/body/encoder/layer_0/self_attention/multihead_attention/dot_product_attention/attention: Could not satisfy explicit device specification '/device:GPU:0'
because no supported kernel for GPU devices is available. Colocation
Debug Info: Colocation group had the following types and supported
devices: Root Member(assigned_device_name_index_=-1
requested_device_name_='/device:GPU:0' assigned_device_name_=''
resource_device_name_='' supported_device_types_=[CPU]
possible_devices_=[] ImageSummary: CPU
Colocation members, user-requested devices, and framework assigned
devices, if any:
transformer/body/parallel_0/body/encoder/layer_0/self_attention/multihead_attention/dot_product_attention/attention
(ImageSummary) /device:GPU:0
Op: ImageSummary Node attrs: max_images=1, T=DT_FLOAT,
bad_color=Tensor Registered
kernels: device='CPU'
when i print the run_conf.session_config, I got allow_soft_placement: true. Many people said it can solve the problem of InvalidArgumentError but seems not work on me.
I have trained gradient boosted classifier with TF exampled code
https://www.tensorflow.org/tutorials/estimators/boosted_trees_model_understanding
but,
TF estimator gradient boosted classifier suddenly stopped while training
I think it takes several steps at begging , than suddenly stopped without any exception print
how can i get any reason why python crash
it 's hard to get the reason why it stopped
res:
lib : TF-gpu 1.13.1
cuda : 10.0
cudnn : 7.5
logs :
2019-04-15 16:40:26.175889: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0
with properties: name: GeForce GTX 1060 6GB major: 6 minor: 1
memoryClockRate(GHz): 1.7845 pciBusID: 0000:07:00.0 totalMemory:
6.00GiB freeMemory: 4.97GiB 2019-04-15 16:40:26.182620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible
gpu devices: 0 2019-04-15 16:40:26.832040: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device
interconnect StreamExecutor with strength 1 edge matrix: 2019-04-15
16:40:26.835620: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-04-15 16:40:26.836840: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-04-15 16:40:26.838276: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created
TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with
4716 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060
6GB, pci bus id: 0000:07:00.0, compute capability: 6.1)
WARNING:tensorflow:From
D:\python\lib\site-packages\tensorflow\python\training\saver.py:1266:
checkpoint_exists (from
tensorflow.python.training.checkpoint_management) is deprecated and
will be removed in a future version. Instructions for updating: Use
standard file APIs to check for files with this prefix.
WARNING:tensorflow:From
D:\python\lib\site-packages\tensorflow\python\training\saver.py:1070:
get_checkpoint_mtimes (from
tensorflow.python.training.checkpoint_management) is deprecated and
will be removed in a future version. Instructions for updating: Use
standard file utilities to get mtimes. WARNING:tensorflow:Issue
encountered when serializing resources. Type is unsupported, or the
types of the items don't match field type in CollectionDef. Note this
is a warning and probably safe to ignore. '_Resource' object has no
attribute 'name' WARNING:tensorflow:Issue encountered when serializing
resources. Type is unsupported, or the types of the items don't match
field type in CollectionDef. Note this is a warning and probably safe
to ignore. '_Resource' object has no attribute 'name'
D:\py> (just finished on training)
trn = pd.read_csv('data/santander-customer-transaction-prediction/train.csv')
tst = pd.read_csv('data/santander-customer-transaction-prediction/test.csv')
#trn = upsample(trn[trn.target==0], trn[trn.target==1])
# trn = downsample(trn[trn.target==0], trn[trn.target==1])
features = trn.columns.values[2:202]
target_name = trn.columns.values[1]
train=trn[features]
target=trn[target_name]
NUM_EXAMPLES = len (target)
print (NUM_EXAMPLES)
feat1 = train.corrwith(target).sort_values().head(20).index
feat2 = train.corrwith(target).sort_values().tail(20).index
featonly = feat1.append(feat2)
feat = featonly.append(pd.Index(['target']))
train_origin, tt = train_test_split(trn, test_size=0.2)
train = train_origin[featonly]
target = train_origin[target_name]
test = tst[featonly]
target_name_tst = tst.columns.values[1]
target_tst=tst[target_name_tst]
val_origin=tt
val_train = tt[featonly]
val_target = tt[target_name]
# Training and evaluation input functions.
train_input_fn = make_input_fn(train, target)
val_input_fn = make_input_fn(val_train, val_target)
ttt=tf.estimator.inputs.pandas_input_fn(x=test,num_epochs=1,shuffle=False)
del train,target,val_train,train_origin,trn,tst
fc = tf.feature_column
feature_columns = []
for feature_name in featonly:
feature_columns.append(fc.numeric_column(feature_name,dtype=tf.float32))
#feature_columns
#5
#tf.logging.set_verbosity(tf.logging.INFO)
#logging_hook = tf.train.LoggingTensorHook({"loss" : loss, "accuracy" : accuracy}, every_n_iter=10)
params = {
'n_trees': 50,
'max_depth': 3,
'n_batches_per_layer': 1,
# You must enable center_bias = True to get DFCs. This will force the model to
# make an initial prediction before using any features (e.g. use the mean of
# the training labels for regression or log odds for classification when
# using cross entropy loss).
'center_bias': True
}
# config = tf.estimator.RunConfig().replace(keep_checkpoint_max = 1,
# log_step_count_steps=20, save_checkpoints_steps=20)
est = tf.estimator.BoostedTreesClassifier(feature_columns, **params,model_dir='d:\py/model/')
est.train(train_input_fn, max_steps=50)
-------------------------------------------stopped
metrics = est.evaluate(input_fn=val_input_fn,steps=1)
results = est.predict(input_fn=ttt )
result_list = list(results)
classi = list(map(lambda x : x['classes'][0].decode("utf-8"), result_list))
num = list(range(0,len(classi)))
numi = list(map(lambda x : 'test_' + str(x),num))
#df1 = pd.DataFrame(columns=('ID_code','target'))
df_result = pd.DataFrame({'ID_code' : numi, 'target' : classi})
df_result.to_csv('result/submission03.csv',index=False)
def make_input_fn(X, y, n_epochs=None, shuffle=True):
def input_fn():
NUM_EXAMPLES = len(y)
dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))
# dataset = tf.data.Dataset.from_tensor_slices((X.to_dict(orient='list'), y))
#if shuffle:
# dataset = dataset.shuffle(NUM_EXAMPLES)
# For training, cycle thru dataset as many times as need (n_epochs=None).
dataset = (dataset.repeat(n_epochs).batch(NUM_EXAMPLES))
return dataset
return input_fn
evaluation result should be shown
I think the problem is caused by GPU memory overflow.
You can try to modify the value of 'n_batches_per_layer' to some bigger value according to you GPU memory size.
I worked with a 6G GPU, the value is 16.
I am observing that on my machine tf.matmul in tensorflow is running significantly slower than dot product in numpy. I have GTX 1080 GPU, and expecting tf.matmul to be at least as fast as when running the code using CPU (numpy).
Environment Info
Operating System
lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.10
Release: 16.10
Codename: yakkety
Installed version of CUDA and cuDNN:
ls -l /usr/local/cuda-8.0/lib64/libcud*
-rw-r--r-- 1 root root 556000 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudadevrt.a
lrwxrwxrwx 1 root root 16 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so -> libcudart.so.8.0
lrwxrwxrwx 1 root root 19 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so.8.0 -> libcudart.so.8.0.61
-rwxr-xr-x 1 root root 415432 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so.8.0.61
-rw-r--r-- 1 root root 775162 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart_static.a
lrwxrwxrwx 1 voldemaro users 13 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so -> libcudnn.so.5
lrwxrwxrwx 1 voldemaro users 18 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so.5 -> libcudnn.so.5.1.10
-rwxr-xr-x 1 voldemaro users 84163560 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so.5.1.10
-rw-r--r-- 1 voldemaro users 70364814 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn_static.a
TensorFlow Setup
python -c "import tensorflow; print(tensorflow.__version__)"
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
1.0.0
Code:
'''
Created on Sep 28, 2017
#author: voldemaro
Running on I7/GTX 1080
no MKL
('TF version: ', 'v1.0.0-rc2-15-g47bba63-dirty')
('TF url: ', 'https://github.com/tensorflow/tensorflow/commit/47bba63')
Timing in ms for 2048 x 2048 SVD of type <type 'numpy.float32'> and matmul for 16920 x 2048 of type <type 'numpy.float32'>
numpy default SVD min: 3956.20, median: 4127.75, mean: 4264.41
TF CPU SVD min: 5926.43, median: 5951.70, mean: 5961.43
TF GPU SVD min: 5917.10, median: 6015.87, mean: 6039.63
numpy default .dot product min: 5816.97, median: 5933.43, mean: 5965.22
TF CPU matmul min: 21939.19, median: 22485.99, mean: 22374.69
TF GPU matmul min: 22026.52, median: 22109.97, mean: 22199.43
'''
from scipy import linalg; # for svd
import numpy as np;
import os;
import sys;
import time;
os.environ["TF_CPP_MIN_LOG_LEVEL"]="2" # nospam
import tensorflow as tf;
import gc; gc.disable();
NUM_RUNS = 5;
dtype = np.float32;
N=2048;
M = 16920;
def get_tensorflow_version_url():
import tensorflow as tf
version=tf.__version__
commit = tf.__git_version__
# commit looks like this
# 'v1.0.0-65-g4763edf-dirty'
commit = commit.replace("'","")
if commit.endswith('-dirty'):
dirty = True
commit = commit[:-len('-dirty')]
commit=commit.rsplit('-g', 1)[1]
url = 'https://github.com/tensorflow/tensorflow/commit/'+commit
return url
def get_mkl_version():
import ctypes
import numpy as np
ver = np.zeros(199, dtype=np.uint8)
mkl = ctypes.cdll.LoadLibrary("libmkl_rt.so")
mkl.MKL_Get_Version_String(ver.ctypes.data_as(ctypes.c_char_p), 198)
return ver[ver != 0].tostring()
timeline_counter = 0
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE);
def benchmark(message, func):
time_list = []
for i in range(NUM_RUNS):
start_time = time.time();
func();
time_list.append(time.time()-start_time);
time_list = 1000*np.array(time_list); # get seconds, convert to ms
if len(time_list)>0:
min = np.min(time_list);
median = np.median(time_list);
formatted = ["%.2f"%(d,) for d in time_list[:10]];
result = "min: %8.2f, median: %8.2f, mean: %8.2f"%(min, median, np.mean(time_list))
else:
result = "empty"
print("%-20s %s"%(message, result))
if np.__config__.get_info("lapack_mkl_info"):
print("MKL version", get_mkl_version())
else:
print("no MKL")
print("TF version: ", tf.__git_version__)
print("TF url: ", get_tensorflow_version_url())
svd_array = np.random.random_sample((N,N)).astype(dtype);
another_array = np.random.random_sample((M,N)).astype(dtype);
init_OP = tf.global_variables_initializer();
with tf.device("/gpu:0"):
init_holder_gpu = tf.placeholder(dtype, shape=(M,M));
specVarGPU = tf.random_uniform((N,N), dtype=dtype);
S_gpu = tf.random_uniform((M,N), dtype=dtype);
V_gpu = tf.matmul(tf.matmul(tf.transpose(tf.transpose(tf.conj(S_gpu))), specVarGPU, ), tf.transpose(S_gpu));
[D2_gpu, E1_gpu, E2_gpu] = tf.svd(specVarGPU);
with tf.device("/cpu:0"):
init_holder_cpu = tf.placeholder(dtype, shape=(M,M));
specVarCPU = tf.random_uniform((N,N), dtype=dtype);
S_cpu = tf.random_uniform((M,N), dtype=dtype);
V_cpu = tf.matmul(tf.matmul(tf.transpose(tf.transpose(tf.conj(S_cpu))), specVarCPU, ), tf.transpose(S_cpu));
[D2_cpu, E1_cpu, E2_cpu] = tf.svd(specVarCPU);
V_cpu = tf.matmul(tf.matmul(tf.transpose(tf.transpose(tf.conj(S_cpu))), E1_cpu), tf.transpose(S_cpu));
print("Timing in ms for %d x %d SVD of type %s and matmul for %d x %d of type %s"%(N, N, dtype, M, N, dtype));
def func(): linalg.svd(svd_array)
benchmark("numpy default SVD", func)
config = tf.ConfigProto(allow_soft_placement = True, graph_options=tf.GraphOptions(optimizer_options=tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)));
sess = tf.Session(config = config);
sess.run(init_OP);
def func2(): sess.run([D2_cpu.op, E1_cpu.op, E2_cpu.op]);
benchmark("TF CPU SVD", func2);
def func3(): sess.run([D2_gpu.op, E1_gpu.op, E2_gpu.op]);
benchmark("TF GPU SVD", func3);
def func1(): np.transpose(np.asmatrix(another_array)).getH().dot(svd_array).dot(np.transpose(another_array));
benchmark("numpy default .dot product", func1)
def func4(): sess.run([V_cpu]);
benchmark("TF CPU matmul", func4)
def func5(): sess.run([V_gpu])
benchmark("TF GPU matmul", func4)
Apparently tensorflow does not optimize "nested" operations, so
tf.matmul(tf.transpose(tf.conj(a)), x) takes significantly longer time than b = tf.conj(a), c = tf.transpose(b), and d = tf.matmul(c, x).
For SVD, the problem is that there is no GPU Kernel for SVD yet. See here: https://github.com/tensorflow/tensorflow/issues/11588
This means that SVD has to be computed on the CPU, even if the tensors are instantiated on the GPU. For this reason, there's an overhead for transferring data from the GPU to the CPU for computation, then back to the GPU for storing results.
For matmul on the GPU the problem is in the last line of your bechmarking code: you are not calling func5 but func4 again, so you are benchmarking the TF CPU matmul.
Aside from this, there are a few other things you may want to check in your code:
there is no need for the init_holder_cpu and init_holder_gpu vars, as you don't use them
there is no need to run the global_variables_initializer, as there are no variables
you are redefining V_cpu, using one of the outputs from SVD, so you are effectively running both SVD and the matmul in your test
A slightly cleaned up version of the code looks like:
# ... above is the same
print("TF version: ", tf.__git_version__)
print("TF url: ", get_tensorflow_version_url())
svd_array = np.random.random_sample((N,N)).astype(dtype)
another_array = np.random.random_sample((M,N)).astype(dtype)
with tf.device("/gpu:0"):
specVarGPU = tf.random_uniform((N, N), dtype=dtype)
S_gpu = tf.random_uniform((M, N), dtype=dtype)
V_gpu = tf.matmul(tf.matmul(tf.transpose(tf.transpose(tf.conj(S_gpu))), specVarGPU, ), tf.transpose(S_gpu))
D2_gpu, E1_gpu, E2_gpu = tf.svd(specVarGPU)
with tf.device("/cpu:0"):
specVarCPU = tf.random_uniform((N,N), dtype=dtype)
S_cpu = tf.random_uniform((M,N), dtype=dtype)
V_cpu = tf.matmul(tf.matmul(tf.transpose(tf.transpose(tf.conj(S_cpu))), specVarCPU, ), tf.transpose(S_cpu))
D2_cpu, E1_cpu, E2_cpu = tf.svd(specVarCPU)
config = tf.ConfigProto(allow_soft_placement = True, graph_options=tf.GraphOptions(optimizer_options=tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)))
def V_numpy():
np.matmul(np.matmul(np.transpose(np.transpose(np.conj(another_array))), svd_array, ), np.transpose(another_array))
with tf.Session(config = config) as sess:
print("Timing in ms for %d x %d SVD of type %s and matmul for %d x %d of type %s"%(N, N, dtype, M, N, dtype))
benchmark("numpy default SVD", lambda: linalg.svd(svd_array))
benchmark("TF CPU SVD", lambda: sess.run([D2_cpu.op, E1_cpu.op, E2_cpu.op]))
benchmark("TF GPU SVD", lambda: sess.run([D2_gpu.op, E1_gpu.op, E2_gpu.op]))
benchmark("numpy MKL matmul", V_numpy)
benchmark("TF CPU matmul", lambda: sess.run([V_cpu.op]))
benchmark("TF GPU matmul", lambda: sess.run([V_gpu.op]))
And outputs (ona an i7 and GTX 1070)
MKL version b'Intel(R) Math Kernel Library Version 2017.0.3 Product Build 20170413 for Intel(R) 64 architecture applications'
TF version: v1.4.0-rc1-11-g130a514
TF url: https://github.com/tensorflow/tensorflow/commit/130a514
Timing in ms for 2048 x 2048 SVD of type <class 'numpy.float32'> and matmul for 16920 x 2048 of type <class 'numpy.float32'>
numpy default SVD min: 3318.42, median: 3320.40, mean: 3320.40
TF CPU SVD min: 4576.71, median: 4577.02, mean: 4577.02
TF GPU SVD min: 14022.59, median: 14172.69, mean: 14172.69
numpy MKL matmul min: 4500.33, median: 4628.01, mean: 4628.01
TF CPU matmul min: 15420.19, median: 15664.84, mean: 15664.84
TF GPU matmul min: 277.80, median: 282.54, mean: 282.54
You can see that the GPU version of matmul is much faster than any CPU implementation, as expected.