TensorFlow Norm (LRN) doesn't support GPU - tensorflow

I am running following code on Google Cloud ML using BASIC GPU (Tesla K80)
https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10
LRN is taking the most amount of time and its running on CPU. I am wondering if following stats quoted in https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_train.py were obtained by running on CPU because I don't see thats the case.
System | Step Time (sec/batch) | Accuracy
1 Tesla K20m | 0.35-0.60 | ~86% at 60K steps (5 hours)
If I force it to run it with GPU it throws following error:
Cannot assign a device to node 'norm1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available. [[Node: norm1 = LRNT=DT_HALF, alpha=0.00011111111, beta=0.75, bias=1, depth_radius=4, _device="/device:GPU:0"]

Related

GPU available but not used. Set `accelerator` and `devices` using `Trainer(accelerator='gpu', devices=1)`

model_nbeats = NBEATSModel(
input_chunk_length=30,
output_chunk_length=7,
generic_architecture=True,
num_stacks=10,
num_blocks=1,
num_layers=4,
layer_widths=512,
n_epochs=100,
nr_epochs_val_period=1,
batch_size=800,
model_name="nbeats_run",
)
model_nbeats.fit(train, val_series=val, verbose=True)
Hi,
While running the code above, "PossibleUserWarning: GPU available but not used. Set accelerator and devices using Trainer(accelerator='gpu', devices=1)." is appearing in the output cell of Google Colab. How can I fix it?
I have tried darts.nbeats to fit training data in 100 epochs. I select GPU to accelerate the training process. However, the training was not accelerated and the afferomentioned possiblewarning was appeared.

aws gpu oom issue onnx cuda

Doing predictions on AWS GPU instance g4dn.4xlarge(16gb gpu memory,64 gb cpu mem) and deployed with k8s & dockers.
Tested with (cuda10.1 + onnxruntime-gpu==1.4.0 ) and (cuda10.2 + onnxruntime-gpu==1.6.0) same error.Models are customised for our purpose,cant point to weights.
Problem is :
Getting cuda oom(out of memory) error:
Error: onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Conv node. Name:'Conv_16' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:298 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 33554432
On some backtracking:
Using nvidia-smi commands and GPU memory profiling, found for the 1st prediction and for next all predictions a constant GPU memory of ~1.8GB minimum for some models ~ 3 GB is blocked for some (I think it's blocked for multiprocess ). Releasing mem doesnt make sense , coz for next prediction same amount of mem will be blocked.
My understanding:
So at the peak, we are scaling up to 22 pods & in every pod, the model load is initialized, and hence every pod is blocking 1.8 ~ 3gb of memory & pointing to 1 GPU instance of 16 GB GPU memory.So, with 22 pods, oom is expected.
What is confusing:
Above cuda message throws oom, but gpu profiling shows memory utilisation is never more than 50% , though SM(Streaming multiprocessing) is 100% at peak(when pods scaled to 22).Attached image for refernce.
On research I understood that SM has nothing to do with oom and cuda would handle sm efficiently. Then why getting cuda oom error if only 50% mem is utilised?
Ruled out.
I ruled out memory leak from model , as it runs w/o oom error when load is low.
Why GPU and not CPU for prediction.
Want faster predictions. Ran on CPU w/o any error ,even on high load.
What I am looking for:
A solution to scale AWS GPU instances on the basis of GPU memory.If oom is reason ,scaling on GPU mem should solve problem.I can't find.
Understanding cuda msg , when mem is available why oom ?
Being very hypothetical. If there is a way to create singleton object by design or using k8s for particular model load and saled up pods can utilise that model load object for prediction rather than creating new server. BUt that would kill sense or using k8s for availabilty & scalabilty.

How to effectively use the TFRC program with the GCP AI platform Jobs

I'm trying to run a hyperparameter tunning job into GCP's AI platform job service, the Tensorflow Research Cloud program approved to me
100 preemptible Cloud TPU v2-8 device(s) in zone us-central1-f
20 on-demand Cloud TPU v2-8 device(s) in zone us-central1-f
5 on-demand Cloud TPU v3-8 device(s) in zone europe-west4-a
I already built a custom model on Tensorflow 2, and I want to run the job specifying the exact zone to take advantage of the TFRC program plus the AI platform job service; right now I have a YAML config file that looks like:
trainingInput:
scaleTier: basic-tpu
region: us-central1
hyperparameters:
goal: MAXIMIZE
hyperparameterMetricTag: val_accuracy
maxTrials: 100
maxParallelTrials: 16
maxFailedTrials: 30
enableTrialEarlyStopping: True
In theory, if I run 16 parallel jobs each one in a separate TPU instance should work but, instead return an error due to the petition exceed the quota of TPU_V2
ERROR: (gcloud.ai-platform.jobs.submit.training) RESOURCE_EXHAUSTED: Quota failure for project ###################. The request for 128 TPU_V2 accelerators for 16 parallel runs exceeds the allowed maximum of 0 A100, 0 TPU_V2_POD, 0 TPU_V3_POD, 16 TPU_V2, 16 TPU_V3, 2 P4, 2 V100, 30 K80, 30 P100, 6 T4 accelerators.
Then I reduce the maxParallelTrials to only 2 and worked, which confirms given the above error message the quota is counting by TPU chip, not by TPU instance.
Therefore I think, maybe I completely misunderstood the approved quota of the TFRC program then I proceed to check if the job is using the us-central1-f zone but turns out that is using an unwanted zone:
-tpu_node={"project": "p091c8a0a31894754-tp", "zone": "us-central1-c", "tpu_node_name": "cmle-training-1597710560117985038-tpu"}"
That behavior doesn't allow me to use effectively the free approved quota, and if I understand correctly the job running in the us-central1-c is taking credits of my account but does not use the free resources. Hence I wonder if there's some way to set the zone in the AI platform job, and also it is possible to pass some flag to use preemptible TPUs.
Unfortunately the two can't be combined.

Order of CUDA devices [duplicate]

This question already has answers here:
How does CUDA assign device IDs to GPUs?
(4 answers)
Closed 4 years ago.
I saw this solution, but it doesn't quite answer my question; it's also quite old so I'm not sure how relevant it is.
I keep getting conflicting outputs for the order of GPU units. There are two of them: Tesla K40 and NVS315 (legacy device that is never used). When I run deviceQuery, I get
Device 0: "Tesla K40m"
...
Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0
Device 1: "NVS 315"
...
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
On the other hand, nvidia-smi produces a different order:
0 NVS 315
1 Tesla K40m
Which I find very confusing. The solution I found for Tensorflow (and a similar one for Pytorch) is to use
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"
PCI Bus ID is 4 for Tesla and 3 for NVS, so it should set it to 3 (NVS), is that right?
In pytorch I set
os.environ['CUDA_VISIBLE_DEVICES']='0'
...
device = torch.cuda.device(0)
print torch.cuda.get_device_name(0)
to get Tesla K40m
when I set instead
os.environ['CUDA_VISIBLE_DEVICES']='1'
device = torch.cuda.device(1)
print torch.cuda.get_device_name(0)
to get
UserWarning:
Found GPU0 NVS 315 which is of cuda capability 2.1.
PyTorch no longer supports this GPU because it is too old.
warnings.warn(old_gpu_warn % (d, name, major, capability[1]))
NVS 315
So I'm quite confused: what's the true order of GPU devices that tf and pytorch use?
By default, CUDA orders the GPUs by computing power. GPU:0 will be the fastest GPU on your host, in your case the K40m.
If you set CUDA_DEVICE_ORDER='PCI_BUS_ID' then CUDA orders your GPU depending on how you set up your machine meaning that GPU:0 will be the GPU on your first PCI-E lane.
Both Tensorflow and PyTorch use the CUDA GPU order. That is consistent with what you showed:
os.environ['CUDA_VISIBLE_DEVICES']='0'
...
device = torch.cuda.device(0)
print torch.cuda.get_device_name(0)
Default order so GPU:0 is the K40m since it is the most powerful card on your host.
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ['CUDA_VISIBLE_DEVICES']='0'
...
device = torch.cuda.device(0)
print torch.cuda.get_device_name(0)
PCI-E lane order, so GPU:0 is the card with the lowest bus-id in your case the NVS.

Is TensorFlow (GPU version) compatible with laptops using Nvidia Quadro P1000, P2000 and 1050Ti?

I am going to buy a laptop to do some TF work. Is the GPU version of TF able to take advantage of Nvidia Quadro P1000 and P2000? Will it run faster on these two GPUs than on the mobile version of 1050Ti? Thanks
If I am correct, Tensorflow can run in all Nvidia devices that supports CUDA.
Check this website for their computational compabilities:
https://developer.nvidia.com/cuda-gpus
There you can see the computational power of Nvidia GPU cards.
For your questions about those three cards (P1000, P2000, GeForce 1050Ti), they all have the same computational capabilities: 6.1, which means they won't differ too much in GPU computation.
But from their datasheet (P2000, P1000, 1050ti):
---------------------------------------------------------
| | Memory | Memory Interface | Memory Bandwidth|
---------------------------------------------------------
|P1000 |4G GDRR5| 128 bit | 82Gb/s |
|P2000 |5G GDDR5| 160 bit | 140Gb/s |
|1050Ti |4G GDDR5| 128 bit | 112Gb/s |
---------------------------------------------------------
I would say, P2000 > 1050Ti > P1000
BTW, what does that 6.1 number mean? Basically, it means how much operations and functions they can support. You can find the details in the figure below and this link, and similar discussion here