Is there a way to set parameters for the Java VM when using a Snakemake wrapper? - snakemake

When using tools like picard or fgbio through snakemake wrappers, I keep running into out-of-memory issues. At the moment I resort to direct shell calls, which allow me to set the VMs memory. I would prefer to pass these parameters to the wrapped tools. Is there a way, maybe through the resources directive, passing something like mem_mb=10000? I tried, but have not gotten it to work yet.

I have never used the wrapper directive but looking for example at markduplicates/wrapper.py the shell command is picard MarkDuplicates {snakemake.params} .... So maybe using the params slot works?
rule markdups:
input:
'in.bam',
output:
bam= 'out.bam',
metrics= 'metrics.tmp',
params:
mem= "-Xmx4g",
wrapper:
"0.31.0/bio/picard/markduplicates"
picard should understand that -Xmx... is a java parameter.

According to wrapper sources (https://bitbucket.org/snakemake/snakemake-wrappers/src/bd3178f4b82b1856370bb48c8bdbb1932ace6a19/bio/picard/markduplicates/wrapper.py?at=master&fileviewer=file-view-default), it uses cmdline:
from snakemake.shell import shell
shell("picard MarkDuplicates {snakemake.params} INPUT={snakemake.input} "
"OUTPUT={snakemake.output.bam} METRICS_FILE={snakemake.output.metrics} "
"&> {snakemake.log}")
So you could pass any options using params: "smth" section.
If you check picard excecutable script sources:
cat `which picard`
You will find:
...
pass_args=""
for arg in "$#"; do
case $arg in
'-D'*)
jvm_prop_opts="$jvm_prop_opts $arg"
;;
'-XX'*)
jvm_prop_opts="$jvm_prop_opts $arg"
;;
'-Xm'*)
jvm_mem_opts="$jvm_mem_opts $arg"
;;
*)
if [[ ${pass_args} == '' ]] #needed to avoid preceeding space on first arg e.g. ' MarkDuplicates'
then
pass_args="$arg"
else
pass_args="$pass_args \"$arg\"" #quotes later arguments to avoid problem with ()s in MarkDuplicates regex arg
fi
;;
esac
done
...
So I assume this should work:
rule markdups:
input:
"in.bam",
output:
bam = "out.bam",
metrics = "metrics.tmp",
params:
"-Xmx10000m"
wrapper:
"0.31.0/bio/picard/markduplicates"

Related

Nextflow: publishDir, output channels, and output subdirectories

I've been trying to learn how to use Nextflow and come across an issue with adding output to a channel as I need the processes to run in an order. I want to pass output files from one of the output subdirectories created by the tool (ONT-Guppy) into a channel, but can't seem to figure out how.
Here is the nextflow process in question:
process GupcallBases {
publishDir "$params.P1_outDir", mode: 'copy', pattern: "pass/*.bam"
executor = 'pbspro'
clusterOptions = "-lselect=1:ncpus=${params.P1_threads}:mem=${params.P1_memory}:ngpus=1:gpu_type=${params.P1_GPU} -lwalltime=${params.P1_walltime}:00:00"
output:
path "*.bam" into bams_ch
script:
"""
module load cuda/11.4.2
singularity exec --nv $params.Gup_container \
guppy_basecaller --config $params.P1_gupConf \
--device "cuda:0" \
--bam_out \
--recursive \
--compress \
--align_ref $params.refGen \
-i $params.P1_inDir \
-s $params.P1_outDir \
--gpu_runners_per_device $params.P1_GPU_runners \
--num_callers $params.P1_callers
"""
}
The output of the process is something like this:
$params.P1_outDir/pass/(lots of bams and fastqs)
$params.P1_outDir/fail/(lots of bams and fastqs)
$params.P1_outDir/(a few txt and log files)
I only want to keep the bam files in $params.P1_outDir/pass/, hence trying to use the pattern = "pass/*.bam, but I've tried a few other patterns to no avail.
The output syntax was chosen since once this process is done, using the following channel works:
// Channel
// .fromPath("${params.P1_outDir}/pass/*.bam")
// .ifEmpty { error "Cannot find any bam files in ${params.P1_outDir}" }
// .set { bams_ch }
But the problem is if I don't pass the files into the output channel of the first process, they run in parallel. I could simply be missing something in the extensive documentation in how to order processes, which would be an alternative solution.
Edit: I forgo to add the error message which is here: Missing output file(s) `*.bam` expected by process `GupcallBases` and the $params.P1_outDir/ contains the subdirectories and all the log files despite the pattern argument.
Thanks in advance.
Nextflow processes are designed to run isolated from each other, but this can be circumvented somewhat when the command-line input and/or outputs are specified using params. Using params like this can be problematic because if, for example, a params variable specifies an absolute path but your output declaration expects files in the Nextflow working directory (e.g. ./work/fc/0249e72585c03d08e31ce154b6d873), you will get the 'Missing output file(s) expected by process' error you're seeing.
The solution is to ensure your inputs are localized in the working directory using an input declaration block and that the outputs are also written to the work dir. Note that only files specified in the output declaration block can be published using the publishDir directive.
Also, best to avoid calling Singularity manually in your script block. Instead just add singularity.enabled = true to your nextflow.config. This should also work nicely with the beforeScript process directive to initialize your environment:
params.publishDir = './results'
input_dir = file( params.input_dir )
guppy_config = file( params.guppy_config )
ref_genome = file( params.ref_genome )
process GuppyBasecaller {
publishDir(
path: "${params.publishDir}/GuppyBasecaller",
mode: 'copy',
saveAs: { fn -> fn.substring(fn.lastIndexOf('/')+1) },
)
beforeScript 'module load cuda/11.4.2; export SINGULARITY_NV=1'
container '/path/to/guppy_basecaller.img'
input:
path input_dir
path guppy_config
path ref_genome
output:
path "outdir/pass/*.bam" into bams_ch
"""
mkdir outdir
guppy_basecaller \\
--config "${guppy_config}" \\
--device "cuda:0" \\
--bam_out \\
--recursive \\
--compress \\
--align_ref "${ref_genome}" \\
-i "${input_dir}" \\
-s outdir \\
--gpu_runners_per_device "${params.guppy_gpu_runners}" \\
--num_callers "${params.guppy_callers}"
"""
}

Snakemake-Wildcards in input files cannot be determined from output files

I'm kinda new at snakemake and I'm trying to understand how it works.
I tried to pull a simple snakefile
from snakemake.utils import min_version
min_version("5.3.0")
max_reads: 250000
sra_id: ["SRR1187735"]
rule all:
input:
"DATA/{sra_id}.fastq.gz"
rule prefetch:
output:
"DATA/{sra_id}.fastq.gz"
params:
max_reads = "max_reads"
version: "1.0"
shell:
"conda activate sra-tools-2.10.1 "
"&& "
"fastq-dump {wildcards.sra_id} -X {params.max_reads} --readids \
--dumpbase --skip-technical --gzip -Z > {output} "
"&& "
"conda deactivate "
But I'm getting this error :
WildcardError in line 5 of /save_home/skastalli/test_rule/Snakefile:
Wildcards in input files cannot be determined from output files:
'sra_id'
Can someone help me please ?
Most of your rules should be generalizable and include wildcards, so prefetch is correct. At some point you have to ask for a specific file for snakemake to try and generate. This can occur in several spots of the workflow, but at least the rule all needs to ask for specific files. Based on your preamble, I think the input of your rule all should be:
rule all:
input:
expand("DATA/{sra_id}.fastq.gz", sra_id=sra_id)
Also, because snakemake runs in debug mode for shell, you don't need the && in the shell directive and can just use newlines instead. Personal preference though.

snakemake unpack with shell & conda

I have the basic "input can be single end or paired end reads" problem for my snakemake pipeline. I'd like to use unpack if possible, since it seems designed for this situation (as illustrated in the answer for this issue), but I also want to use conda:, which requires shell:. I believe that shell: will die if I have {input.read2} but it's not provided by unpack(). Is there any good way of getting around this besides either 1) creating 2 nearly identical rules 2) making an empty read2 (if single-end) and then creating an if-else in shell to check for whether read2 is empty. Neither is ideal.
Try to combine your input function with a params function to generate the flags for either paired or single end. Using the bowtie example from your link:
def bowtie2_inputs(wildcards):
if (seq_type == "pe"):
return expand("{reads}_{strand}.fastq", strand=["R1", "R2"], reads=wildcards.reads)
elif (seq_type == "se"):
return expand("{reads}.fastq", reads=wildcards.reads)
def bowtie2_params(wildcards, input):
if (seq_type == "pe"):
return f'-1 {input.reads[0]} -2 {input.reads[1]}'
else:
return f'-U {input.reads}'
rule bowtie2:
input:
reads=bowtie2_inputs,
index=bowtie2_index
output:
sam="{reads}_bowtie2.sam"
params:
file_args=bowtie2_params
conda: <env>
shell:
"bowtie2 -x {input.index} {params.file_args} -S {output.sam}"
Not sure it's any better than the shell option. I would use two rules with a ruleorder preferring the paired ends. That would be easier to modify if you wanted say a different aligner or to change parameters for each case. As is this requires a bit of jumping around to actually see what the rule does.

How to stop snakemake from adding non file endings to wildcards when using expand function? (.g.vcf fails, .vcf works)

Adding .g.vcf instead of .vcf after the variable in expand rule is somehow adding the .g to a wildcard in another module
I have tried the following in the all rule :
{stuff}.g.vcf
{stuff}"+"g.vcf"
{stuff}_var"+".g.vcf"
{stuff}.t.vcf
all fail but {stuff}.gvcf or {stuff}.vcf work
Error:
InputFunctionException in line 21 of snake_modules/mark_duplicates.snakefile:
KeyError: 'Mother.g'
Wildcards:
lane=Mother.g
Code:
LANES = config["list2"].split()
rule all:
input:
expand(projectDir+"results/alignments/variants/{stuff}.g.vcf", stuff=LANES)
rule mark_duplicates:
""" this will mark duplicates for bam files from the same sample and library """
input:
get_lanes
output:
projectDir+"results/alignments/markdups/{lane}.markdup.bam"
log:
projectDir+"logs/"+stamp+"_{lane}_markdup.log"
shell:
" input=$(echo '{input}' |sed -e s'/ / I=/g') && java -jar /home/apps/pipelines/picard-tools/CURRENT MarkDuplicates I=$input O={projectDir}results/alignments/markdups/{wildcards.lane}.markdup.bam M={projectDir}results/alignments/markdups/{wildcards.lane}.markdup_metrics.txt &> {log}"
I want my final output to have the {stuff}.g.vcf notation. Please note this output is created in another snake module but the error appears in the mark duplicates which is before the other module.
I have tried multiple changes but it is the .g.vcf in the all rule that causes the issue.
My guess is that {lane} is interpreted as a regular expression and it's capturing more than it should. Try adding before rule all:
wildcard_constraints:
stuff= '|'.join([re.escape(x) for x in LANES]),
lane= '|'.join([re.escape(x) for x in LANES])
(See also this thread https://groups.google.com/forum/#!topic/snakemake/wVlJW9X-9EU)

Snakemake - input function exception

I am trying to run snakemake code using.json file as input. While checking the dry run i got foloowing error
InputFunctionException in line 172 of /home/Snakefile_ChIPseq_pe:
KeyError: '130241_1'
Wildcards:
library=130241_1
This is the part of snakemake code
rule findPeaks:
input:
sample = os.path.join(HOMERTAG_DIR, "{library}"),
input = lambda wildcards: os.path.join(HOMERTAG_DIR, config['lib_input'][wildcards.library])
output:
os.path.join(HOMERPEAK_DIR, "{library}.all.hpeaks")
params:
config['homer_findPeaks_params']
shell:
"findPeaks {input.sample} -i {input.input} {params} -o {output}"
There is single quote around input sample which is missing in the 'lib_input' part. How to add that single quote ahead of variable?
Also library names are like 12345_1, 12345_2 etc., never had this problem before however for the first time I have libraries with "underscore" in the names.
Snakemake will first try to interpret the given value as number. Only if that fails, it will interpret the value as string. Here, it does not fail, because the underscore _ is interpreted as thousand separator.
My guess is that in your json file the library IDs are not quoted. E.g. you have this:
{
"lib_input": {1234_1: "input.txt"}
}
Instead of:
{
"lib_input": {"1234_1": "input.txt"}
}
Or maybe library 130241_1 is not in the json at all?