Running on tensorflow, the moment that I try and run two jobs in parallel with any gpu related jobs, it flags and gives me an error regarding cuDNN module being not available. Is there any method where I can run two jobs in parallel (maybe at the cost of increased time length)?
Related
I'm running machine learning (ML) jobs that make use of very little GPU memory.
Thus, I could run multiple ML jobs on a single GPU.
To achieve that, I would like to add multiple lines in the gres.conf file that specify the same device.
However, it seems the slurm deamon doesn't accept this, the service returning:
fatal: Gres GPU plugin failed to load configuration
Is there any option I'm missing to make this work?
Or maybe a different way to achieve that with SLURM?
It is kind of smiliar to this one, but this one seems specific to some CUDA code with compilation enabled. Something which seems way more specific than my general case (or at least as far as I understand).
How to run multiple jobs on a GPU grid with CUDA using SLURM
I don't think you can oversubscribe GPUs, so I see two options:
You can configure the CUDA Multi-Process Service or
pack multiple calculations into a single job that has one GPU and run them in parallel.
Besides nVidia MPS mentioned by #Marcus Boden, which is relevant for V100 types of cards, there also is Multi-Instance GPU which is relevant for A100 types of cards.
I am new to Spark SQL queries and trying to understand it's working under the hood.
I have come across the term "Core" in the Spark vocabulary but still struggling to get a hold on the same.
I know that - 1 core = 1 task.
My questions -
Can anyone please explain what exactly does a core mean ?
Does Spark UI show the number of cores currently allocated for my job ? If yes,
then where can I see it ?
If I find in the Spark UI that the number of tasks running is less, is
there a way to increase the number of cores allocated for my job, so
that Spark can submit more tasks and make my job run faster ?
Please advise.
Yes, you are right in a way.
In spark task are distributed across executors, on each executor number of task running is equal to the number of cores on that executors. So basically core is something that is going to execute your task. The task here is the most granular work that needs to be carried out.
JOB=>STAGE=>TASK
Yes, spark UI shows you the number of the task currently running on your every executor. You can check them under the Executors tab. This tab shows you a very detailed view of your task allocation against the number of cores available and a lot of other details.
Yes, you can increase the number of cores. You can do that by passing the argument in the spark-submit command.
--executor-cores n
Here n is the number of cores you want. For optimum usage, it should be 5.
It is not necessary that more than the number of cores faster your job will run.
Your task needs to be distributed equally across all the cores available to run faster.
If you provide more cores than required they will remain idle most of the time.
As far as I know,
GPU can simultaneously run thousands of threads that belong to one GPU kernel process, and that is the reason why it called SIMT.
However, can GPGPU simultaneously run multiple threads that belong to different GPU kernel processes? If possible, does it mean that those threads can run on one of GPU cores, and multiple cores can run simultaneously? Or does it mean that even one GPU core can run multiple threads belongs to different kernels?
Or simply, is it only possible to run threads that belong to the same GPU kernel process at a time on the entire GPU cores?
Is there a way to oversubscribe GPUs on Slurm, i.e. run multiple jobs/job steps that share one GPU? We've only found ways to oversubscribe CPUs and memory, but not GPUs.
We want to run multiple job steps on the same GPU in parallel and optionally specify the GPU memory used for each step.
The easiest way of doing that is to have the GPU defined as a feature rather than as a gres so Slurm will not manage the GPUs, just make sure that job that need one land on nodes that offer one.
By default, Bazel runs tests in a parallel fashion to speed things up. However, I have a resource (GPU) that can't handle parallel jobs due to the GPU memory limit. Is there a way to force Bazel to run tests in a serial, i.e., non-parallel way?
Thanks.
--jobs 1 will limit the number of parallel jobs Bazel runs to 1.
You can also modify the test targets and add tags = ["exclusive"] to prevent specific test to run in parallel (see http://bazel.io/docs/test-encyclopedia.html).
Use --local_test_jobs=1 to only run a single test job at a time locally.
The max number of local test jobs to run concurrently. Takes an integer, or a keyword ("auto", "HOST_CPUS", "HOST_RAM"), optionally followed by an operation ([-|]) eg. "auto", "HOST_CPUS.5". 0 means local resources will limit the number of local test jobs to run concurrently instead. Setting this greater than the value for --jobs is ineffectual
tags = ["exclusive"] has other complications to consider with respect to caching.
--jobs will serialize the entire build process, not just testing, so it's less than ideal.
There are 2 resources Bazel will respect limitations upon: RAM and CPU. You may hijack one (Probably RAM) to represent GPU(s) as they're available to a run and required by a test. (I've stopped short of doing this for a limited hardware resource because it feels to inelegant, but I can't think of a reason it shouldn't work.)
Future releases of Bazel should support extra resources like GPUs
and releases that contain that change should support extra resource tags like "resources:GPU:1" when --local_extra_resources=gpu=1 is set. This should enable GPU tests to be bound by a limited quantity of GPUs, and for them to run non-exclusively and without limiting the total number of --jobs or "test_jobs"