How to load PluggableDevice into tensorflow to use Mac M1 GPU? - tensorflow

I have successfully built tensorflow v2.9 C library by following this gist on Mac M1.
When I check the devices in Rust code, only CPU device is available. bundle.session.device_list() only returns CPU device:
let bundle = SavedModelBundle::load(
&SessionOptions::new(),
&["serve"],
&mut graph,
export_dir
).expect("Unable to load model from disk");
println!("{:?}", bundle.session.device_list() )
Device { name: "/job:localhost/replica:0/task:0/device:CPU:0", device_type: "CPU", memory_bytes: 268435456, incarnation: 10072007419359857694 }]
The Rust code are using bindings to tensorflow's C API (e.g. device_list = TF_DeviceList). Apple M1's GPU is supported by MPS plugin (https://developer.apple.com/metal/tensorflow-plugin/). From my test, it does work in python. The MPS is implemented as a PluggableDevice to tensorflow, that is to say, it can be loaded without modification to tensorflow source code.
How can I use the deivce plugin in tensorflow?

There is an experimental API called TF_LoadPluggableDeviceLibrary
So first ensure Apple's metal plugin is installed, then find the location of file libmetal_plugin.dylib. Either copy this file into PATH, or add its directory into DYLD_LIBRARY_PATH environment variable.
On application startup, before loading the model, load this library first:
pub fn load_plugable_device(library_filename: &str) -> Result<(), Status> {
use std::ffi::{ CString };
let c_filename = CString::new(library_filename)?;
let raw_lib = unsafe {
let raw_status: *mut tf::TF_Status = tf::TF_NewStatus();
let raw_lib = tf::TF_LoadPluggableDeviceLibrary(c_filename.as_ptr(), raw_status);
if !raw_status.is_null() {
tf::TF_DeleteStatus(raw_status)
}
raw_lib
};
if raw_lib.is_null() {
Err(Status::new())
} else {
Ok(())
}
}
I am using Rust bindings and this experimental API is missing in tensorflow-sys crate. Hence I add it there. You may check my github repository for source code.
Then the lovely GPU is enabled.
2022-08-17 08:16:29.397578: I tensorflow/cc/saved_model/reader.cc:43] Reading SavedModel
2022-08-17 08:16:29.401477: I tensorflow/cc/saved_model/reader.cc:81] Reading meta graph with tags { serve }
2022-08-17 08:16:29.401500: I tensorflow/cc/saved_model/reader.cc:122] Reading SavedModel debug info (if present)
Metal device set to: Apple M1
systemMemory: 16.00 GB
maxCacheSize: 5.33 GB
2022-08-17 08:16:29.411250: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-08-17 08:16:29.411398: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2022-08-17 08:16:29.422914: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2022-08-17 08:16:29.427004: I tensorflow/cc/saved_model/loader.cc:227] Restoring SavedModel bundle.
2022-08-17 08:16:29.429435: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2022-08-17 08:16:29.440099: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.

Related

How to Enable Mixed precision training

i'm trying to train a deep learning model on vs code so i would like to use the GPU for that. I have cuda 11.6 , nvidia GeForce GTX 1650, TensorFlow-gpu==2.5.0 and pip version 21.2.3 for windows 10. The problem is whenever i run this part of code i've got this error : Mixed precision training with AMP or APEX (--fp16 or --bf16) and half precision evaluation (--fp16_full_eval or --bf16_full_eval) can only be used on CUDA devices.
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir=new_output_models_dir,
#output_dir="dev/",
group_by_length=True,
per_device_train_batch_size=16,
gradient_accumulation_steps=2,
#dataloader_num_workers = 1,
dataloader_num_workers = 0,
evaluation_strategy="steps",
num_train_epochs=40,
fp16=True,
save_steps=400,
eval_steps=400,
logging_steps=400,
learning_rate=1e-4,
warmup_steps=500,
save_total_limit=2,
)
I've also tested whether tensorflow can access a gpu and whether tensorflow was built with cuda gpu support using tf.config.list_physical_devices('GPU') and tf.test.is_built_with_cuda() and both of them return TRUE . How to slove this issue ? and why i'm getting this error ? Any ideas !
The above error suggests that it does not accept fp16=True/bf16=True in non-GPU mode. Perhaps Cuda 11.6 might be an issue here which has stability issues.
Test with Cuda 11.2 and CudNN 8.1 . If that does not work you can go with fp16=False parametre.
Ref - https://www.tensorflow.org/install/source#gpu

Tensorflow CUDA fails with error "failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED"

Here is some of my console output. I am unsure what is the actually problem. When this is displayed I get a windows prompt stating Python.exe has stop working with the cause being ucrtbase.dll, but I've tried updating that and it still happens so I think that is the result of the real problem. Also I am notified by a taskbar message that my Nvidia Kernal Driver crashed, but recovered.
2017-11-04 17:48:17.363024: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-04 17:48:17.375024: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-04 17:48:19.995174: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:955
Found device 0 with properties:
name: Quadro K1100M
major: 3 minor: 0 memoryClockRate (GHz) 0.7055
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 1.93GiB
2017-11-04 17:48:19.995174: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:976] DMA: 0
2017-11-04 17:48:19.995174: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:986] 0: Y
2017-11-04 17:48:20.018175: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1045]
Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K1100M, pci bus id: 0000:01:00.0)
2017-11-04 17:49:35.796510: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.93GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-11-04 17:49:41.811854: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_driver.cc:1068] failed to synchronize the stop event: CUDA_ERROR_UNKNOWN
2017-11-04 17:49:41.811854: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_timer.cc:54] Internal: error destroying CUDA event in context 0000000026CFBE70: CUDA_ERROR_UNKNOWN
2017-11-04 17:49:41.811854: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_timer.cc:59] Internal: error destroying CUDA event in context 0000000026CFBE70: CUDA_ERROR_UNKNOWN
2017-11-04 17:49:41.811854: F C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_dnn.cc:2045] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED
If you're still looking for the answer, try reducing the batch size. I'm not entirely sure what is happening with the error (there's no explanation on github either), but reducing the batch size helped me

TensorFlow Norm (LRN) doesn't support GPU

I am running following code on Google Cloud ML using BASIC GPU (Tesla K80)
https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10
LRN is taking the most amount of time and its running on CPU. I am wondering if following stats quoted in https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_train.py were obtained by running on CPU because I don't see thats the case.
System | Step Time (sec/batch) | Accuracy
1 Tesla K20m | 0.35-0.60 | ~86% at 60K steps (5 hours)
If I force it to run it with GPU it throws following error:
Cannot assign a device to node 'norm1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available. [[Node: norm1 = LRNT=DT_HALF, alpha=0.00011111111, beta=0.75, bias=1, depth_radius=4, _device="/device:GPU:0"]

tensorflow distributed training w/ estimator + experiment framework

Hi I have a wield situation when trying to use estimator + experiment class for distributed training.
Here's an example: https://gist.github.com/protoget/2cf2b530bc300f209473374cf02ad829
This is a simple case that uses
DNNClassifier from TF official tutorial
Experiment framework
1 worker and 1 ps on the same host with different ports.
What happens is
1) when I start ps job, it looks good:
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9000}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:9001}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:9000
2) when I start worker job, the job silently exits, leaving no log at all.
Eagerly seeking help.
I have same problem and I finally get the solution.
The problem is in config._environment
config = {"cluster": {'ps': ['127.0.0.1:9000'],
'worker': ['127.0.0.1:9001']}}
if args.type == "worker":
config["task"] = {'type': 'worker', 'index': 0}
else:
config["task"] = {'type': 'ps', 'index': 0}
os.environ['TF_CONFIG'] = json.dumps(config)
config = run_config.RunConfig()
config._environment = run_config.Environment.CLOUD
Set config._environment as Environment.CLOUD.
Then you can have distributed training system.
I hope it makes you happy :)
I have the same issue, it's due to some internal tensorflow code I guess, I've opened a question on SO already for this: TensorFlow: minimalist program fails on distributed mode.
I also opened a pull request: https://github.com/tensorflow/tensorflow/issues/8796.
There are two options to solve your issue. As this is due to your ClusterSpec having implicit local environment, you could try set another one (either google or cloud), but I cannot assure you that the rest of your work won't be impacted. So I prefered to have a glance at the code and try fix it myself for local mode, which is why I explain bellow.
You'll see explanations of why it fails in those posts more precisely, the fact is Google has been pretty silent so far so what I did is that I patched their source code (in tensorflow/contrib/learn/python/learn/experiment.py):
# Start the server, if needed. It's important to start the server before
# we (optionally) sleep for the case where no device_filters are set.
# Otherwise, the servers will wait to connect to each other before starting
# to train. We might as well start as soon as we can.
config = self._estimator.config
if (config.environment != run_config.Environment.LOCAL and
config.environment != run_config.Environment.GOOGLE and
config.cluster_spec and config.master):
self._start_server()
(this part prevents server from starting in local mode, which is yours if you set none in your cluster spec, so you should simply comment config.environment != run_config.Environment.LOCAL and and that should work).

Using multiple gpus on windows using theano,keras

I am a beginner in deep learning/theano/keras.I'm trying to figure out how to use multiple gpus on windows 7. I've had success installing Theano,keras(as described in this post How do I install Keras and Theano in Anaconda Python on Windows?) and using one gpu. I want to use both my gpus
Following are the details of configs and versions
Python - 2.7(Anaconda-4.3.14,Windows-64bit)
,CUDA - 7.5.17
,Theano - 0.9.0rc3
,keras - 1.2.2
,pycuda - 2016.1.2+cuda7518
,gpu - Geforce GTX 480(2 of them)
Theano configuration is as below
.theanorc.txt
[global]
floatX = float32
device = gpu
[nvcc]
flags=-LC:\ProgramData\Anaconda2\libs
compiler_bindir=C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin
[lib]
cnmem=0.8
Currently I'm able to use only one GPU and I am getting memory error as below when I try to fit the model
MemoryError: ('Error allocating 411041792 bytes of device memory (CNMEM_STATUS_OUT_OF_MEMORY).', "you might consider using 'theano.shared(..., borrow=True)'")
Does using 2 gpus solve the problem(if yes, how do I enable the second one?)
or is my model too big ?
Thank You