How to set threads based on number of cores available? - snakemake

I have a workflow that runs on different machines with different numbers of CPUs, and I'd like to be able to setup a rule that uses "all but N" cores. I.e. I'd like to be able to do:
threads: lambda cores: max(2, cores-4)
But I cannot find any way to access cores (i.e. the value passed to or inferred by -j/--cores on the commmand line) in my rules. Is there a way to do the above?

As #bli said in the comments, this should work:
from multiprocessing import cpu_count
rule test:
output:
example.txt
threads:
lambda cores: max(2, cpu_count() - 4)
shell:
cmd -{threads}

Related

How to efficiently combine N files, two at a time

Based on the discussion in another question.
Some tools will only accept two input files at a time, but the final, merged output requires merging N output files. Examples include paste and some bed or vcf tools. Assume a list of samples is present and the binary operation is associative, (a+b)+c == a+(b+c). The required, merged output must be generated by repeatedly combining input and intermediate files. How can you efficiently merge the files?
The two solutions I will present are to sequentially combine input files and to recursively build intermediate files as a binary tree. For each, consider pasting together a few hundred samples with the following start of a snakefile:
ids = list('abcdefghijklmnopqrstuvwxyz')
samples = expand('{id1}{id2}', id1=ids, id2=ids) # 676 samples, need not be numbers
# aa, ab, ac, .., zz
rule all:
input: 'merged.txt'
rule generate_data:
output: 'sample_{sample}.txt'
shell:
'echo {wildcards.sample} > {output}'
Sequential Solution
The sequential solution is fairly easy to remember and understand. You combine files 1 and 2 into a temporary file, then combine the temporary file with file 3, ... until file N. You can do this with a run directive and shell commands but I will present this as just a shell directive
rule merge:
input:
first_files=expand('sample_{sample}.txt', sample=samples[:2]),
rest_files=expand('sample_{sample}.txt', sample=samples[2:])
output: 'merged.txt'
shell:
'paste {input.first_files} > {output} \n'
'for file in {input.rest_files} ; do '
'paste {output} $file > {output}_tmp \n'
'mv {output}_tmp {output} \n'
'done '
Recursive Solution
The general idea behind the recursive solution is to combine files 1 and 2, 3 and 4, 5 and 6, ... in the first step them combine those intermediate files until one merged file is left. The difficulty is that snakemake evaluates the dag from the top down and the number of files may not be evenly divisible by 2.
rule merge:
"""Request final output from merged files 0 to N-1."""
input:
f'temp_0_{len(samples)-1}'
output: 'merged.txt'
shell:
'cp {input} {output}'
def merge_intermediate_input(wildcards):
"""From start and end indices, request input files. Raises ValueError when indices are equal."""
start, end = int(wildcards.start), int(wildcards.end)
if start == end: # perform link instead
raise ValueError
if start + 1 == end: # base case
return expand('sample_{sample}.txt',
sample=(samples[start], samples[end]))
# default
return [f'temp_{start}_{(start+end)//2}', f'temp_{(start+end)//2+1}_{end}']
rule merge_intermediate:
"""Solve subproblem, producing start to end."""
input: merge_intermediate_input
output: temp('temp_{start}_{end}')
shell:
'paste {input} > {output}'
def merge_base_input(wildcards):
"""Get input sample from index in list."""
index = int(wildcards.start)
return f'sample_{samples[index]}.txt'
rule merge_base:
"""Create temporary symbolic link for input file with start==end."""
input: merge_base_input
output: temp('temp_{start}_{start}')
shell:
'ln -sr {input} {output}'
merge_intermediate solves the subproblem of producing the merged files from start to end from the two merged files split halfway. When start == end, the merged file is created as a symbolic link. When start + 1 == end, the base case is to merge the input files at those indices. The recursive solution is clearly more code and more complex, but it can be more efficient in long-running or complex merge operations.
Runtime Complexity, Performance
Let each of the N files have k lines and the runtime complexity of the merge operation have O(f(n)). In the sequential solution, the temporary file is created N-1 times and its length increases as 2k, 3k ... for a total of k*N*(N+1)/2 - k ~ O(f(k N^2)).
For the recursive solution, in the first layer, each pair of files is joined. Each operation requires O(f(2k)) and there are N/2 such operations. Next, each pair of the resulting files are merged, at a cost of O(f(4k)) with N/4 operations. Overall, ln(N) layers of merges are required to produce the final output, again with N-1 merge operations. The complexity of the entire operation is O(f(k N ln(n))).
In terms of overhead, the recursive solution launches N-1 snakemake jobs with any associated calls to the scheduler, activating environments, etc. The sequential version launches a single job and runs everything in a single shell process.
The recursive solution can run with more parallelism; each 'level' of the recursive solution is independent, allowing up to N/2 jobs to run at once. The sequential solution requires the results of each previous step. There is an additional challenge with resource estimation for the recursive solution. The first merges have O(2k) while the last has O(k N). The resources could be dynamically estimated or, if the merge step doesn't increase the resulting file size (e.g. intersecting regions), the resources could be similar.
Conclusion
While the recursive solution offers better asymptotic runtime complexity, it introduces more snakemake jobs, temporary files, and complex logic. The sequential solution is straight-forward and contained in a single job, though could be N/ln(N) times slower. Quick merge operations can be successfully performed with the sequential solution and the runtime won't be much worse until N is quite large. However, if merging takes 10s of minutes or longer, depends on the input file sizes, and produces outputs longer than inputs (e.g. cat, paste, and similar), the recursive solution may offer better performance and a significantly shorter wall clock time.

how to get the current consumed cpu % of vmhost in vcenter using powershell

How to get the current consumed CPU% of vmhost in vcenter using powershell script.
Below command doesn't gives similar output what we checked manually.
Get-Stat -Entity $command1 -Stat cpu.usagemhz.average -Realtime -MaxSamples 1
Get-Stat -Entity $myHost -Stat cpu.usage.average -Realtime -MaxSamples 1 -Instance ""
From VMware's doc on this cpu usage perf counter:
Actively used CPU, as a percentage of the total available CPU, for
each physical CPU on the host. Active CPU is approximately equal to
the ratio of the used CPU to the available CPU.
Available CPU = # of physical CPUs × clock rate.
100% represents all CPUs on the host. For example, if a four-CPU host
is running a virtual machine with two CPUs, and the usage is 50%, the
host is using two CPUs completely
Explanations from Luc Dekens around the -Instance filter...
If the ESX/ESXi server is equipped with a quadcore CPU, there will be
four instances: 0, 1, 2 and 3. In this case the instance corresponds
with the numeric position within the CPU core
And there will be a so-called aggregate, which is the metric averaged
over all the instances.
These instances each get their own identifier which will be part of
the returned statistical data. The aggregate instance is always
represented by a blank identifier.
...and -MaxSamples
Although I asked for 1 sample (-MaxSamples 1) the cmdlet returned 9
values. The -MaxSamples parameter apparently only looks at the
Timestamp. It doesn’t count the number of returned values

How to get multi GPUs same type on slurm?

How can I create a job with a multi GPU of the same type but not specific that type directly? My experiment has a constraint that all GPUs have the same type but this type can be whatever we want.
Currently I am able only to create a experiment with multi GPUs with telling exactly what type I want:
--gres=gpu:gres_type:amount
If I don't specify gres_type, then sometimes I get mixed GPUs packs (let say 2x titan V and 2x titan X).
If you are fortunate enough that the cluster is consistent in the types of nodes that host the GPUs, and that the features of the nodes a properly specified and allow distinguishing between the nodes that host the different GPU types, you can use the --constraint parameter.
For the sake of the argument, let's assume that the nodes that host the titanV have haswell CPUs, and those that host the titanX have skylake CPUs and that those are defined as features. Then, you can request
--gres=gpu:2
--constraint=[haswell|skylake]
If the above does not apply to your use case, you can submit two jobs and keep only the one that starts the earliest. For that, give your jobs an identical name, and use the singleton dependency.
Write a submission script like this one
#!/bin/bash
#SBATCH --dependency=singleton
#SBATCH --job-name=gpujob
# Other options
scancel --state=PENDING --jobname=gpujob
# etc.
and submit it twice with
$ sbatch --gres=gpu:titanX:2 submit.sh
$ sbatch --gres=gpu:titanV:2 submit.sh
Each job will be assigned only one type of GPU, and the first one that starts will cancel the other one. This approach can scale up with more than two GPU types.

How to run TensorFlow on multiple nodes with several CPUs each

I want to run linear regression with TensorFlow on very large datasets. I have a cluster with 9 nodes and 36 CPUs each. What is the best way to distribute the computations across all the resources available?
According to this course https://www.coursera.org/learn/intro-tensorflow, the best way to use TensorFlow on distributed setting is to use Estimators. So I wrote my code as suggested there and followed the instructions at https://www.tensorflow.org/deploy/distributed for the parallelisation. I then tried to run my script my_code.py (on a "small" dataset with 120 million data points and 2 feature columns to test the code) on nodes 2 and 3 as follows:
python my_code.py \
--ps_hosts=node1:2222 \
--worker_hosts=node2:2222,node3:2222
--job_name=worker
--task_index="i-2"
where i is the number of the node (either 2 or 3); whereas on node 1 I do the same but with --job_name=ps and --task_index=0. However this way it seems that only one CPU per node is used. Do I need to specify each CPU individually?
Thank you in advance.
As far as I understand, the best thing to do is to use all the CPUs on the same node together as a single worker, in order to make the most of the shared memory. So for example in the case above, one would have to specify manually only 9 workers and make sure that each of them corresponds to one node where all the 36 CPUs are used. The commands to do this depend on the specific cluster used.

Parallelism in (I)Python with large blocks of data

I've been toiling with threads and processes for a while now to try to speed up my very parallel job in IPython. I'm not sure how much detail about the function I'm calling is useful, so here's a bash but ask if you need more.
My function's call signature looks like
def intersplit_array(ob,er,nl,m,mi,t,ti,dmax,n0=6,steps=50):
Basically, ob, er and nl are parameters for observed values and m,mi,t,ti and dmax are parameters that represent models against which the observations will be compared. (n0 and steps are fixed numerical parameters for the function.) The function loops through all the models in m and, using associated information in mi, t, ti and dmax, calculates a probability that this model matches. Note that m is quite big: it's a list of about 700 000 22x3 NumPy arrays. mi and dmax are of similar sizes. If releant, my normal IPython instance uses about 25% of system memory in top: 4GB of my 16GB of RAM.
I've tried to parallelize this in two ways. First, I tried to use the parallel_map function given over at the SciPy Cookbook. I made the call
P = parallel_map(lambda i: intersplit_array(ob,er,nl,m[i+1],mi[i:i+2],t[i+1],ti[i:i+2],dmax[i+1],range(1,len(m)-1))
which runs, and provides the correct answer. Without the parallel_ part, this is just the result of applying the function one by one to each element. But this is slower than using a single core. I guess this is related to the Global Interpreter Lock?
Second, I tried to use a Pool from multiprocessing. I initialized a pool with
p = multiprocessing.Pool(6)
and then tried to call my function with
P = p.map(lambda i: intersplit_array(ob,er,nl,m[i+1],mi[i:i+2],t[i+1],ti[i:i+2],dmax[i+1],range(1,len(m)-1))
First, I get an error.
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 319, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
Having a look in top, I then see all the extra ipython processes, each of which is apparently taking up 25% of RAM (which can't be so, because I've still got 4GB free) and using 0% CPU. I presume it isn't doing anything. I can't use IPython, either. I tried Ctrl-C for a while, but gave up once I got passed the 300th pool worker.
Does it work not interactively?
multiprocessing doesn't play well interactively, because of the way it splits processes. This is also why you had trouble killing it because it spawned so many processes. You would have to keep track of the master process to cancel it.
From the documentation:
Note
Functionality within this package requires that the __main__ module be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the multiprocessing.Pool examples will not work in the interactive interpreter.
...
If you try this it will actually output full tracebacks interleaved in a semi-random fashion, and then you may have to stop the master process somehow.
The best solution is probably to just run it as a script from the command line. Alternatively, IPython has its own system for parallel computing, but I've never used it.