how to pass a function under snakemake run directive - snakemake

I am building a workflow in snakemake and would like to recycle one of the rules to two different input sources. The input sources could be either source1 or source1+source2 and depending on the input the output directory would also vary. Since this was quite complicated to do in the same rule and I didn't want to create the copy of the full rule I would like to create two rules with different input/output, but running same command.
Is it possible to make this work? I get the DAG resolved correctly but the job don't go through on the cluster (ERROR : bamcov_cmd not defined)..
An example below (both rules use the same command at the end):
this is command
def bamcov_cmd():
return( (deepTools_path+"bamCoverage " +
"-b {input.bam} " +
"-o {output} " +
"--binSize {params.bw_binsize} " +
"-p {threads} " +
"--normalizeTo1x {params.genome_size} " +
"{params.read_extension} " +
"&> {log}") )
this is the rule
rule bamCoverage:
input:
bam = file1+"/{sample}.bam",
bai = file1+"/{sample}.bam.bai"
output:
"bamCoverage/{sample}.filter.bw"
params:
bw_binsize = bw_binsize,
genome_size = int(genome_size),
read_extension = "--extendReads"
log:
"bamCoverage/logs/bamCoverage.{sample}.log"
benchmark:
"bamCoverage/.benchmark/bamCoverage.{sample}.benchmark"
threads: 16
run:
bamcov_cmd()
this is the optional rule2
rule bamCoverage2:
input:
bam = file2+"/{sample}.filter.bam",
bai = file2+"/{sample}.filter.bam.bai"
output:
"bamCoverage/{sample}.filter.bw"
params:
bw_binsize = bw_binsize,
genome_size = int(genome_size),
read_extension = "--extendReads"
log:
"bamCoverage/logs/bamCoverage.{sample}.log"
benchmark:
"bamCoverage/.benchmark/bamCoverage.{sample}.benchmark"
threads: 16
run:
bamcov_cmd()

What you asked is possible in python.
It depends if you have JUST python code in the file, or python and Snakemake.
I will answer that first, and then I have a follow up response because I want you to set it up differently so you don't have to do it this way.
Just Python:
from fileContainingMyBamCovCmdFunction import bamcov_cmd
rule bamCoverage:
...
run:
bamcov_cmd()
Visually, see how I do it in this file, to reference access to buildHeader and buildSample. These files are being called by a Snakefile. It should work the same for you.
https://github.com/LCR-BCCRC/workflow_exploration/blob/master/Snakemake/modules/py_buildFile/buildFile.py
EDIT 2017-07-23 - Updating code segment below to reflect user comment
Snakemake and Python:
include: "fileContainingMyBamCovCmdFunction.suffix"
rule bamCoverage:
...
run:
shell(bamcov_cmd())
EDIT END
If the function is truly specific to the bamCoverage call, if you prefer you can put it back in the rule. This implies it's not being called elsewhere, which may be true.
Be careful when annotating files using '.' notation, I use '_' as I find it's easier to prevent creating cyclical dependencies this way.
Also, if you do end up leaving the two rules separately, you will likely end up with ambiguity errors.
http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=ruleorder#handling-ambiguous-rules
When possible, it's best practice to have rules generating unique outputs.
As for alternatives, consider setting up the code like this?
from subprocess import call
rule all:
input:
"path/to/file/mySample.bw"
#OR
#"path/to/file/mySample_filtered.bw"
bamCoverage:
input:
bam = file1+"/{sample}.bam",
bai = file1+"/{sample}.bam.bai"
output:
"bamCoverage/{sample}.bw"
params:
bw_binsize = bw_binsize,
genome_size = int(genome_size),
read_extension = "--extendReads"
log:
"bamCoverage/logs/bamCoverage.{sample}.log"
benchmark:
"bamCoverage/.benchmark/bamCoverage.{sample}.benchmark"
threads: 16
run:
callString= deepTools_path + "bamCoverage " \
+ "-b " + wilcards.input.bam \
+ "-o " + wilcards.output \
+ "--binSize " str(params.bw_binsize) \
+ "-p " + str({threads}) \
+ "--normalizeTo1x " + str(params.genome_size) \
+ " " + str(params.read_extension) \
+ "&> " + str(log)
call(callString, shell=True)
rule filterBam:
input:
"{pathFB}/{sample}.bam"
output:
"{pathFB}/{sample}_filtered.bam"
run:
callString="samtools view -bh -F 512 " + wildcards.input \
+ ' > ' + wildcards.output
call(callString, shell=True)
Thoughts?

Related

nextflow input and output a tuple with keys

I am processing file using Nextflow, that have a sample Id and would like to carry this sampleID across processes, so im using tuples. The relevant snippet of the code is here:
process 'rsem_quant' {
input:
val genome from params.genome
tuple val(sampleId), file(read1), file(read2) from samples_ch
output:
tuple sampleId , path "${sampleId}.genes.results" into rsem_ce
script:
"""
module load RSEM
rsem-calculate-expression --star --keep-intermediate-files \
--sort-bam-by-coordinate --star-output-genome-bam --strandedness reverse \
--star-gzipped-read-file --paired-end $genome \
$read1 $read2 $sampleId
"""
The problem is that when using a tuple as an output, I get the following error:
No such variable: sampleId
If I remove the tuple, and just output either part (sampleId, or the path) it works fine, any help is appreciated
I was unable to reproduce the error with the code supplied. I suspect your output block needs to define the output type val for the 'sampleId' variable:
output:
tuple val(sampleId) , path("${sampleId}.genes.results") into rsem_ce
A minimal example to run RSEM on paired-end reads (using Conda) might look like:
nextflow.enable.dsl=2
params.ref_name = 'GRCh38_GENCODE_v31'
params.ref_fasta = 'ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/GRCh38.primary_assembly.genome.fa.gz'
params.ref_gtf = 'ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.primary_assembly.annotation.gtf.gz'
params.strandedness = 'reverse'
include { gunzip as gunzip_fasta } from './gzip.nf'
include { gunzip as gunzip_gtf } from './gzip.nf'
process 'rsem_prepare_ref' {
conda 'rsem star samtools'
input:
val ref_name
path ref_fasta
path ref_gtf
output:
path "${ref_name}"
"""
mkdir "${ref_name}"
rsem-prepare-reference \\
--gtf "${ref_gtf}" \\
--star \\
"${ref_fasta}" \\
"${ref_name}/${ref_name}"
"""
}
process 'rsem_calculate_expression' {
tag { sample }
conda 'rsem star samtools'
input:
tuple val(sample), path(reads)
path ref_name
output:
tuple val(sample), path("${sample}.genes.results")
script:
def (read1, read2) = reads
"""
rsem-calculate-expression \\
--star \\
--sort-bam-by-coordinate \\
--star-output-genome-bam \\
--strandedness "${params.strandedness}" \\
--star-gzipped-read-file \\
--paired-end \\
"${read1}" \\
"${read2}" \\
"${ref_name}/${ref_name}" \\
"${sample}"
"""
}
workflow {
reads = Channel.fromFilePairs( './data/*_{1,2}.fastq.gz' )
ref_fasta = gunzip_fasta( params.ref_fasta )
ref_gtf = gunzip_gtf( params.ref_gtf )
rsem_prepare_ref( params.ref_name, ref_fasta, ref_gtf )
rsem_calculate_expression( reads, rsem_prepare_ref.out )
}
Contents of gzip.nf:
process gunzip {
tag { gzfile.name }
input:
path gzfile
output:
path "${gzfile.getBaseName()}"
when:
gzfile.getExtension() == "gz"
"""
gzip -dc "${gzfile}" > "${gzfile.getBaseName()}"
"""
}
Run using:
nextflow run test.nf -resume -ansi-log false
Results:
N E X T F L O W ~ version 21.04.3
Launching `main.nf` [awesome_poincare] - revision: 51040c89cc
[cf/ffec1a] Cached process > gunzip_fasta (GRCh38.primary_assembly.genome.fa.gz)
[ce/b7a04b] Cached process > gunzip_gtf (gencode.v38.primary_assembly.annotation.gtf.gz)
[f1/bcb8e3] Cached process > rsem_prepare_ref
[de/f7906e] Submitted process > rsem_calculate_expression (HBR_Rep2)
[1e/3984da] Submitted process > rsem_calculate_expression (UHR_Rep1)
[59/907f56] Submitted process > rsem_calculate_expression (UHR_Rep3)
[26/41db23] Submitted process > rsem_calculate_expression (HBR_Rep1)
[e8/2c98fe] Submitted process > rsem_calculate_expression (UHR_Rep2)
[03/bbb42b] Submitted process > rsem_calculate_expression (HBR_Rep3)

Nextflow adding def function in to script

I have got errors like .command.sh: line 2: syntax error near unexpected token `('
/*
* Step 3
*/
chr_length = file(params.chr_length)
process create_bedgraph_and_bigwig {
publishDir "${params.outdir}/bedgraphandbigwig", mode: 'copy'
input:
set val(sample_id), file(vector_log) from vector_log_ch
set val(sample_id), file(target_query_bam) from target_query_bam_ch
file chr_length
output:
set val(sample_id), file("${sample_id}.bedgraph.log.txt") into bed_log_ch
set val(sample_id), file("${sample_id}.bed") into bed_ch
set val(sample_id), file("${sample_id}.clean.bed") into clean_bed_ch
set val(sample_id), file("${sample_id}.fragments.bed") into fragments_bed_ch
set val(sample_id), file("${sample_id}.sorted.fragments.bed") into sorted_fragments_bed_ch
shell:
'''
def fp = file(${vector_log})
def lines = fp.readLines()
def line3 = lines[3].split(' ')[4].toInteger()
def line4 = lines[4].split(' ')[4].toInteger()
def aln_sum = (10000/(line3 + line4)).toString()
bedtools bamtobed -bedpe -i !{target_query_bam} > !{sample_id}.bed 2>!{sample_id}.bedgraph.log.txt
awk '$1==$4 && $6-$2 < 1000 {{print $0}}' !{sample_id}.bed > !{sample_id}.clean.bed 2>!{sample_id}.bedgraph.log.txt
cut -f 1,2,6 !{sample_id}.clean.bed > !{sample_id}.fragments.bed 2>!{sample_id}.bedgraph.log.txt
sort -k 1,1 !{sample_id}.fragments.bed > !{sample_id}.sorted.fragments.bed
'''
}
The simple answer is to avoid using 'def' if the variable needs to be used in a shell definition or template. I couldn't actually find this after a quick search of the documentation, but I did find this note from the author:
Using groovy native string interpolation that would work, but when using the !{..} syntax scripts variable cannot be declared locally using the def keyword.
To summarise:
script/shell variable should be defensively declared in the local scope using the def keyboard
do not use def when:
i. the variable needs to be referenced as a output value
ii. the variable needs to be used in a shell template
https://github.com/nextflow-io/nextflow/issues/678#issuecomment-386206123

Catching snakemake runtime errors with onerror

I'm working on a bioinformatics pipeline and one of the rule is to perform genome assembly using SPAdes (https://github.com/ablab/spades):
rule perform_de_novo_assembly_using_spades:
input:
bam = input.bam
output:
spades_contigs = directory(_RESULTS_DIR + "assembled_spades/")
threads:
_NBR_CPUS
run:
out_dir = _RESULTS_DIR + "assembled_spades/"
command = "spades.py -t " + str(threads) + " --12 " + input.bam + " -o " + out_dir
shell(command + " || true")
... More rules
onsuccess: print("Pipeline completed successfully!")
onerror: print("One or more errors occurred. Please refer to log file.")
The problem is sometimes for some problematic input files SPAdes can fail, resulting in my workflow being terminated. Therefore I added " || true " to the shell command to run SPAdes (according to this post: What would be an elegant way of preventing snakemake from failing upon shell/R error?) so that workflow will continue despite SPAdes failing. However right now my pipeline will run and still gives the "Pipeline completed successfully!" onsuccess message at the end. Ideally I want to it to print the onerror message "One or more errors occurred. Please refer to log file." Is there a way to make my workflow continue to the end despite SPAdes giving a runtime error and snakemake catching the error so that it displays the onerror message at the end?

Custom Linux distro - mono runtime not found

I am trying to add mono to core-image-minimal for P202RDB custom Linux distro. Here is my bblayers.conf file:
# LAYER_CONF_VERSION is increased each time build/conf/bblayers.conf
# changes incompatibly
LCONF_VERSION = "6"
BBPATH = "${TOPDIR}"
BBFILES ?= ""
BBLAYERS ?= " \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/poky/meta \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/poky/meta-yocto \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/poky/meta-yocto-bsp \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/meta-freescale \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/meta-freescale-internal \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/meta-freescale-extra \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/meta-mono \
"
BBLAYERS_NON_REMOVABLE ?= " \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/poky/meta \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/poky/meta-yocto \
"
Now, when I try to build image using bitbake core-image-minimal, I get following output from it:
Loading cache: 100% |##############################################################################################################| ETA: 00:00:00
Loaded 1496 entries from dependency cache.
NOTE: Resolving any missing task queue dependencies
Build Configuration:
BB_VERSION = "1.26.0"
BUILD_SYS = "x86_64-linux"
NATIVELSBSTRING = "Debian-8.6"
TARGET_SYS = "powerpc-fsl-linux-gnuspe"
MACHINE = "p2020rdb"
DISTRO = "fsl-qoriq"
DISTRO_VERSION = "1.9"
TUNE_FEATURES = "m32 spe ppce500v2"
TARGET_FPU = "ppc-efd"
meta
meta-yocto
meta-yocto-bsp = "(detachedfromb74ea96):ddf114933ccfc6e3ce51a10e8e8f95e514b73578"
meta-freescale = "(detachedfrom7fb32a2):7fb32a20983a0ebd5503eb42e851550b0deb8679"
meta-freescale-internal = "(detachedfrom220bff8):220bff8b2030e5af7393b5870d74c6f0af0d76d1"
meta-freescale-extra = "(nobranch):ced26c806cb566b1400a2f4f26a94d8d44d13233"
meta-mono = "daisy:f01b4f7a98d07abcf4c1f845c057199e112fb7d6"
NOTE: Preparing RunQueue
NOTE: Executing SetScene Tasks
NOTE: Executing RunQueue Tasks
NOTE: Tasks Summary: Attempted 1248 tasks of which 1248 didn't need to be rerun and all succeeded.
It seems mono repository is found, then I prepare SD card using this image and it boots without problems on target board, however, mono command is not available. What am I missing?
Add
IMAGE_INSTALL_append = " mono"
to your local.conf. Just adding a layer doesn't add any package to your image.
Even better, create your own image, and add mono to IMAGE_INSTALL in that recipe.

Turn absolute file paths and line numbers in the tool output into hyperlinks

This is an example output:
/usr/local/bin/node /usr/local/bin/elm-make src/elm/Main.elm --output=builds/main.js
-- TYPE MISMATCH ---------------------------------------------- src/elm/Main.elm
The type annotation for `init` does not match its definition.
35| init : Maybe Route.Location -> ( Model, Cmd Msg )
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The type annotation is saying:
Maybe Route.Location -> ( { route : Maybe Route.Location }, Cmd Msg )
But I am inferring that the definition has this type:
Maybe Route.Location
-> ( { route : Maybe Route.Location -> Route.Model }, Cmd a )
Detected errors in 1 module.
Process finished with exit code 1
This is the regex that i came up with:
http://regexr.com/3egqu
However, creating output filter out of it like this:
doesn't work.
Thus far, I only know that the following works: ------ ($FILE_PATH$)
And it turns the file path into a link:
Help me find a way to include the line numbers into the links.
Here's what I've come up with;
First,
elm-make --report json
outputs the build errors in structured JSON;
$ elm-make --report json src/main.elm
[{"tag":"unused import","overview":"Module `Bootstrap.CDN` is unused.","details":"Best to remove it. Don't save code quality for later!","region":{"start":{"line":3,"column":1},"end":{"line":3,"column":28}},"type":"warning","file":"src/main.elm"}]
Now you can pipe that output through jq (see here). to reformat it into
elm make src/main.elm --report json --output ./public/app.js | \
jq '.[] | { type: .type, file: .file, line: .region.start.line|tostring, tag: .tag, column: .region.start.column|tostring, tag: .tag, details: .details }' | \
jq --raw-output '. | "[" + (.type|ascii_upcase) + "] " + .file + ":" + .line + ":" + .column + " " + .tag + " -- " + .details + "\n"'
that gives you a reformatted output;
[WARNING] src/main.elm:9:1 unused import -- Best to remove it. Don't save code quality for later!
[WARNING] src/main.elm:17:1 missing type annotation -- I inferred the type annotation so you can copy it into your code:
main : Program Never Model Main.Msg
Which you pick up in intellij using the format
$FILE_PATH$:$LINE$:$COLUMN$ $MESSAGE$
You then get to click on an error message to jump to the file, and the error text in a tooltip.