Kubernetes and GPU node cluster implementation practices - tensorflow

I am trying to understand the K8s gpu practices better, and implementing a small K8s GPU cluster which is suppose to work like below.
This going to be little long explanation, but I hope it will help to have many questions at once place to understand GPU practices better in Kubernetes.
Application Requirement
I want to create a K8s autoscale cluster.
Pods are running the models say a tensorflow based deep learning program.
Pods are waiting for a message in pub sub queue to appear and it can proceed
its execution once it recieves a message.
Now a message is queued in a PUB/SUB queue.
As message is available, pods reads it and execute deep learning program.
Cluster requirement
If no message is present in queue and none of the GPU based pods are executing program( i mean not using gpu), then gpu node pool should scale down to 0.
Design 1
Create a gpu node pool. Each node contains N GPU, where N >= 1.
Assign model trainer pod to each gpu. That is 1:1 mapping of pods and GPU.
When I tried assigning 2 pods to 2 GPU machine where each pod is suppose to run a mnist program.
What I noticed is
1 pod got allocated and executes the program and later it went into crash loop. May be I am doing some mistake as my docker image is suppose to run program once only as I was just doing feasibility test of running 2 pods simultaneously on 2 gpu of same node.Below is the error
Message Reason First Seen Last Seen Count
Back-off restarting failed container BackOff Jun 21, 2018, 3:18:15 PM Jun 21, 2018, 4:16:42 PM 143
pulling image "nkumar15/mnist" Pulling Jun 21, 2018, 3:11:33 PM Jun 21, 2018, 3:24:52 PM 5
Successfully pulled image "nkumar15/mnist" Pulled Jun 21, 2018, 3:12:46 PM Jun 21, 2018, 3:24:52 PM 5
Created container Created Jun 21, 2018, 3:12:46 PM Jun 21, 2018, 3:24:52 PM 5
Started container Started Jun 21, 2018, 3:12:46 PM Jun 21, 2018, 3:24:52 PM 5
The other pod didn't get assigned at all to GPU. Below is the message from pod events
0/3 nodes are available: 3 Insufficient nvidia.com/gpu.
Design 2
Have multiple GPU machines in gpu node pool with each node having only 1 GPU.
K8s, will assign each pod to each available GPU in node and hopefully there won't be any issue. I am yet to try this.
Questions
Is there any suggested practice to design above type of system in kubernetes as of version 1.10?
Is Design 1 approach not feasible as of 1.10 release? For eg, I have 2 GPU node with 24 GB GPU memory, is it possible such that K8s assign
1 pod to each GPU and each pods execute its own workload with 12GB memory limit each?
How do I scale down gpu node pool to 0 size through autoscaler?
In Design 2, say what if I run out of GPU memory? as curently in GCP 1 GPU node doesn't have more than 16 GB memory.
Again apologies for such a long question, but I hope it will help other also.
Updates
For question 2
I created a new cluster to reproduce same issue which I faced multiple times before, I am not sure what changed this time but 2nd pod is successfully allocated a GPU. I think with this result I can confirm that 1gpu-1pod mapping is allowed in a multi gpu single node
However restricting memory per gpu process is not feasible as of 1.10.

Both designs are supported in 1.10. I view design 2 as a special case of 1. You don't necessarily need to have 1 GPU per node. In case your pod needs more GPUs and memory, you have to have multiple GPUs per node, as you mentioned in question (4). I'd go with 1 unless there's a reason not to.
I think the best practice would be create a new cluster with no GPUs (a cluster has a default node pool), and then create a GPU node pool and attach it to the cluster. Your non-GPU workload can run in the default pool, and the GPU workload can run in the GPU pool. To support scaling-down to 0 GPUs, you need to set --num-nodes and --min-nodes to be 0 when creating the GPU node pool.
Docs:
Create a cluster with no GPUs: https://cloud.google.com/kubernetes-engine/docs/how-to/creating-a-cluster#creating_a_cluster
Create a GPU node pool for an existing cluster: https://cloud.google.com/kubernetes-engine/docs/concepts/gpus#gpu_pool

Related

Get mapping between /dev/nvidia* and nvidia-smi gpu list

A server with 4 GPUs is used for deep learning.
It often happens that GPU memory is not freed after the training process was terminated (killed). Results shown by nvidia-smi is
Nvidia-smi results
The cuda device 2 is used. (Might be a process launched with CUDA_VISIBLE_DEVICES=2)
Some sub-processes are still alive thus occupy the memory.
One bruce-force solution is to kill all processes created by python using:
pkill -u user_name python
This should be helpful if there is only one process to be cleaned up.
Another solution proposed by pytorch official My GPU memory isn’t freed properly
One may find them via
ps -elf | grep python.
However, if multiple processes are launched and we only want to kill the ones that related to a certain GPU, we can group processes by the gpu index (nvidia0, nvidia1, ...) as:
fuser -v /dev/nvidia*
fuser -v results
As we can see, /dev/nvidia3 is used by some python threads. Thus /dev/nvidia3 corresponds to cuda device 2.
The problem is: I want to kill certain processes launched with setting of CUDA_VISIBLE_DEVICES=2, but I do not know the gpu index (/dev/nvidia0, /dev/nvidia1, ...).
How to find the mapping between CUDA_VISIBLE_DEVICES={0,1,2,3} and /dev/nvidia{0,1,2,3}.
If you set CUDA_DEVICE_ORDER=PCI_BUS_ID environment variable, then the order should be consistent between CUDA and nvidia-smi.
There is also another option (if you are sure the limiting to a specific GPU is done via CUDA_VISIBLE_DEVICES env var). Every process' environment can be examined in /proc/${PID}/environ. The format is partially binary, but grepping through the output usually works (if you force grep to treat the file as text file). This might require root privileges.

Flink job on EMR runs only on one TaskManager

I am running EMR cluster with 3 m5.xlarge nodes (1 master, 2 core) and Flink 1.8 installed (emr-5.24.1).
On master node I start a Flink session within YARN cluster using the following command:
flink-yarn-session -s 4 -jm 12288m -tm 12288m
That is the maximum memory and slots per TaskManager that YARN let me set up based on selected instance types.
During startup there is a log:
org.apache.flink.yarn.AbstractYarnClusterDescriptor - Cluster specification: ClusterSpecification{masterMemoryMB=12288, taskManagerMemoryMB=12288, numberTaskManagers=1, slotsPerTaskManager=4}
This shows that there is only one task manager. Also when looking at YARN Node manager I see that there is only one container running on one of the core nodes. YARN Resource manager shows that the application is using only 50% of cluster.
With the current setup I would assume that I can run Flink job with parallelism set to 8 (2 TaskManagers * 4 slots), but in case that submitted job has set parallelism to more than 4, it fails after a while as it could not get desired resources.
In case the job parallelism is set to 4 (or less), the job runs as it should. Looking at CPU and memory utilisation with Ganglia it shows that only one node is utilised, while the other flat.
Why is application run only on one node and how to utilise the other node as well? Did I need to set up something on YARN that it would set up Flink on the other node as well?
In previous version of Flik there was startup option -n which was used to specify number of task managers. The option is now obsolete.
When you're starting a 'Session Cluster', you should see only one container which is used for the Flink Job Manager. This is probably what you see in the YARN Resource Manager. Additional containers will automatically be allocated for Task Managers, once you submit a job.
How many cores do you see available in the Resource Manager UI?
Don't forget that the Job Manager also uses cores out of the available 8.
You need to do a little "Math" here.
For example, if you would have set the number of slots to 2 per TM and less memory per TM, then submitted a job with parallelism of 6 it should have worked with 3 TMs.

reset memory usage of a single GPU

I have access to 4 GPUs(not root user). One of the GPU(no. 2) behaves weird, their is some memory blocked but the power consumption and temperature is very low(as if nothing is running on it). See details from nvidia-smi in the image below:
How can I reset the GPU 2 without disturbing the processes running on the other GPUs?
PS: I am not a root user but I think I can catch hold of some root user as well.
resetting a gpu can resolve you problem somehow it could be impossible due your GPU configuration
nvidia-smi --gpu-reset -i "gpu ID"
for example if you have nvlink enabled with gpus it does not go through always, and also it seems that nvidia-smi in your case is unable to find the process running over your gpu, the solution for your case is finding and killing associated process to that gpu by running following command, fill out the PID with one that are you find by fuser there
fuser -v /dev/nvidia*
kill -9 "PID"

Why is Spark2 running on only one node?

I am running Spark2 from Zeppelin (0.7 in HDP 2.6) and I am doing an idf transformation which crashes after many hours. It is run on a cluster with a master and 3 datanodes: s1, s2 and s3. All nodes have a Spark2 client and each has 8 cores and 16GB RAM.
I just noticed it is only running on one node, s3, with 5 executors.
In zeppelin-env.sh I have set zeppelin.executor.instances to 32 and zeppelin.executor.mem to 12g and it has the line:
export MASTER=yarn-client
I have set yarn.resourcemanager.scheduler.class to org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.
I also set spark.executor.instances to 32 in the Spark2 interpreter.
Anyone have any ideas what else I can try to get the other nodes doing their share?
The answer is because I am an idiot. Only S3 had datanode and nodemanager installed. Hopefully this might help someone.

Synchronous distributed tensorflow training runs asynchronously

System Information:
Debian 4.5.5
TF installed from binary (pip3 install tensorflow-gpu==1.0.1 --user)
TF version: v1.0.0-65-g4763edf-dirty 1.0.1
Bazel version: N.A.
CUDA 8.0 cuDNN v5.1
Steps to reproduce
Make a directory and download the following files into it:
training.py run.sh
Run the command ./run.sh to simply reproduce this issue.
Detailed descriptions for the bug
Recently, I tried to deploy the synchronous distributed tensorflow training on the cluster. I followed the tutorial and the inception example to write my own program. The training.py is from other user's implementation, which follows the same API usage as the official example. I modified it to enable it running on a single machine with multiple GPUs by making them communicate through localhost and mapping each worker to see only one GPU.
The run.sh launched three processes. One of them is the parameter server and the others are two workers implemented by between-graph replication. I created the training supervisor by tf.train.Supervisor() to manage multiple sessions in the distributed training for the initialization and synchronization.
I expect these two workers would synchronize each batch and work in the same epoch. However, the worker 0, which is launched prior to the worker 1, completed the whole training set without waiting for the worker 1. After that, the process of the worker 0 finished training process and exited normally while worker 1 behaved like falling into the deadlock and keep near 0% utilization of CPU and GPU for several hours.
Based on my observation, I suspect these two workers didn't communicate and synchronize at all for the data they passed. I report this problem as a bug because I create the optimizer tf.train.SyncReplicasOptimizer as suggested by the official website and the inception example. However, it seems that the synchronization behaviors, if any, are very strange and the program can not exit normally.
Source code / logs
Two files:
training.py: This file contains the source code for the parameter server and workers created to use synchronous distributed optimizers (tf.train.SyncReplicasOptimizer).
run.sh: This file launched the parameter server and the workers.
Log:
Please produce according to the steps and look at worker_0_log and worker_1_log