snakemake script access to stdin / stdout for stream processing - snakemake

For a Snakemake workflow, I need to manipulate tags in many BAM files, and would like to process these by piping them through a script (using the Snakemake script: directive). The specific way I'm doing this is with pysam stream processing.
infile = pysam.AlignmentFile("-", "rb")
outfile = pysam.AlignmentFile("-", "wb", template=infile)
for s in infile:
(flowcell, lane) = s.query_name.split(':')[0:2]
rg_id = ".".join([flowcell, lane])
s.set_tag('RG',rg_id,'Z')
outfile.write(s)
This script works well standalone, but I haven't been able to figure out how to integrate it via the snakemake script directive.
I prefer this way to minimize IO and RAM usage.
Edit: Resorted to direct loading to fix the RG tag.
# parameters passed from snakemake
bam_file = snakemake.input[0]
fixed_bam_file = snakemake.output[0]
bamfile = pysam.AlignmentFile(bam_file, "rb")
fixed_bamfile = pysam.AlignmentFile(fixed_bam_file, "wb", template = bamfile)
for i, read in enumerate(bamfile.fetch()):
(flowcell, lane) = read.query_name.split(':')[0:2]
rg_id = ".".join([flowcell, lane])
read.set_tag('RG',rg_id,'Z')
fixed_bamfile.write(read)
if not (i % 100000):
print("Updated the read group for {} reads to {}".format(i, rg_id))
bamfile.close()
fixed_bamfile.close()
EDIT: Snakemakes run: and shell: directives set the workdir: directory, while the script: directive operates relative to the directory where the Snakefile was executed (keeping everything nice and tidy). Hence the problem of putting a stream processor under script:.

Using shell instead of script directive:
rule all:
input:
expand('{sample}_edited.bam'), sample=['a', 'b', 'c']
rule somename:
input:
'{sample}.bam'
output:
'{sample}_edited.bam'
shell:
'''
cat {input} > python edit_bam.py > {output}
'''

#Krischan it seems you found a solution already and if so maybe good to post it as an answer.
Alternatively, you can use the object {workflow} to get the directory of the Snakefile and from there construct the path to your python script. If your directory structure is:
./
├── Snakefile
├── data
│   └── sample.bam
└── scripts
└── edit_bam.py
The Snakefile may look like:
rule all:
input:
'test.tmp',
rule one:
input:
'sample.bam',
output:
'test.tmp',
shell:
r"""
cat {input} \
| {workflow.basedir}/scripts/edit_bam.py > {output}
"""
Executed with snakemake -d data ...
It seems the workflow object is not documented but check this thread Any way to get the full path of the Snakefile within the Snakefile?

Related

gzip command block returns "Too many levels of symbolic links"

Trying to perform a fairly simple gzip command across my fastq files, but a strange error returns.
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
params.gzip = "sequences/sequences_split/sequences_trimmed/trimmed*fastq"
workflow {
gzip_ch = Channel.fromPath(params.gzip)
GZIP(gzip_ch)
GZIP.out.view()
}
process GZIP {
input:
path read
output:
stdout
script:
"""
gzip ${read}
"""
}
Error:
Command error:
gzip: trimmed_SRR19573319_R2.fastq: Too many levels of symbolic links
Tried running a loop in the script instead or run gzip on individual files which works, but would rather use the nextflow syntax.
By default, Nextflow will try to stage process input files using symbolic links. The problem is that gzip actually ignores symbolic links. From the GZIP(1) man page:
The gzip command will only attempt to compress regular files. In particular, it will ignore symbolic links.
If the objective is to create a reproducible workflow, it's usually best to avoid modifying the workflow inputs directly anyway. Either use the stageInMode directive to change how the input files are staged in. For example:
process GZIP {
stageInMode 'copy'
input:
path fastq
output:
path "${fastq}.gz"
"""
gzip "${fastq}"
"""
}
Or, preferably, just modify the command to redirect stdout to a file:
process GZIP {
input:
path fastq
output:
path "${fastq}.gz"
"""
gzip -c "${fastq}" > "${fastq}.gz"
"""
}
Michael!
I can't reproduce your issue. I created the folders in my current directory like you described and created four files in it, as you can see below:
➜ ~ tree sequences/
sequences/
└── sequences_split
└── sequences_trimmed
├── trimmed_a_fastq
├── trimmed_b_fastq
└── trimmed_c_fastq
Then I copy-pasted your Nextflow script file (the only change I did was to use gzip -f ${read} instead of without the -f option. Then everything worked fine. The reason you need -f is because Nextflow has every task contained to a subfolder within work. This means your input files are symbolically linked and gunzip will complain they're not regular files (happened here, macOS Ventura) or something like that (It may depend on OS? Not sure). The -f solves for this issue.
N E X T F L O W ~ version 22.10.1
Launching `ex2.nf` [golden_goldstine] DSL2 - revision: 70559e4bcb
executor > local (3)
[ad/447348] process > GZIP (1) [100%] 3 of 3 ✔
➜ ~ tree work
work
├── 0c
│   └── ded66d5f2e56cfa38d85d9c86e4e87
│   └── trimmed_a_fastq.gz
├── 67
│   └── 949c28cce5ed578e9baae7be2d8cb7
│   └── trimmed_c_fastq.gz
└── ad
└── 44734845950f28f658226852ca4200
└── trimmed_b_fastq.gz
They're gzip compressed files (even though they may look just like text files, depending on the demo content). I decided to reply with an answer because it allows me to use markdown to show you how I did it. Feel free to comment this answer if you want to discuss this topic.

Snakemake - MissingInputException

My snakemake pipeline containing 31 rules is driving me crazy. It's a mapping and snp calling pipeline that uses BWA and HaplotypeCaller among others. I have created a conda environment for each rule, depending on the program used. My code is quite long and can be seen if needed at this address : https://github.com/ltalignani/SHAVE1
Concretely, when I want to build the DAG, snakemake tells me that the haplotype_caller rule doesn't have the reference genome as input. But it is in the file. Here is the concerned code:
rule haplotype_caller_gvcf:
# Aim: Call germline SNPs and indels via local re-assembly of haplotypes
# Use: gatk --java-options '-Xmx{MEM_GB}g' HaplotypeCaller \
# -R Homo_sapiens_assembly38.fasta \
# -I input.bam \
# -O output.g.vcf.gz \
# -ERC GVCF # Essential to GenotypeGVCFs: produce genotype likelihoods
message:
"HaplotypeCaller calling SNVs and Indels for {wildcards.sample} sample ({wildcards.aligner}-{wildcards.mincov})"
conda:
GATK4
input:
refpath = REFPATH,
reference = REFERENCE,
bam = "results/04_Variants/{sample}_{aligner}_{mincov}X_indel-qual.bam"
output:
gvcf="results/04_Variants/haplotypecaller/{sample}_{aligner}_{mincov}X_variant-call.g.vcf"
log:
"results/11_Reports/haplotypecaller/{sample}_{aligner}_{mincov}X_variant-call.log" # optional
resources:
mem_gb= MEM_GB,
shell:
"gatk HaplotypeCaller " # --java-options '-Xmx{resources.mem_gb}g'
"-R {input.refpath}{input.reference} "
"-I {input.bam} "
"-O {output.gvcf} "
"-ERC GVCF" # Essential to GenotypeGVCFs: produce genotype likelihoods
With the REFPATH and REFERENCE variables defined as follows in the snakefile header:
REFPATH = config["consensus"]["path"] # Path to reference genome
REFERENCE = config["consensus"]["reference"] # Genome reference sequence, in fasta format
And the config file in .yaml is like this:
consensus:
reference: "GCA_018104305.1_AalbF3_genomic.fasta"
path: "resources/genomes/" # Path to genome reference
When I ask for the DAG :
snakemake -s workflow/rules/shave.smk --dag | dot -Tpng > test.png
I get this error:
`MissingInputException in line 247 of /Users/loic/snakemake/short-read-alignment-vector-pipeline/workflow/rules/shave.smk:`
Missing input files for rule haplotype_caller_gvcf:
GCA_018104305.1_AalbF3_genomic.fasta
Here is the structure of the snakemake:
enter image description here
also tried to use snakemake --lint but the output was OK.
I've looked at your Github, and the folder resources/genomes/ only contains a file GCA_018104305.1_AalbF3_genomic.fasta.fai. Have you tried renaming this file to the expected input name GCA_018104305.1_AalbF3_genomic.fasta, e.g. get rid of the .fai extension?
Thanks for your answer,
The fasta is 1.47 GB. That's why it's not in the resources/genomes folder.
The .fai is the fasta index, necessary for some programs like GATK.

Snakemake: catch output file whose name cannot be changed

As part of a Snakemake pipeline that I'm building, I have to use a program that does not allow me to specify the file path or name of an output file.
E.g. when running the program in the working directory workdir/ it produces the following output:
workdir/output.txt
My snakemake rule looks something like this:
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
shell: "somecommand {input} {output}"
So every time the rule NAME runs, I get an additional file output.txt in the snakemake working directory, which is then overwritten if the rule NAME runs multiple times or in parallel.
I'm aware of shadow rules, and adding shadow: "full" allows me to simply ignore the output.txt file. However, I'd like to keep output.txt and save it in the same directory as the outputfile. Is there a way of achieving this, either with the shadow directive or otherwise?
I was also thinking I could prepend somecommand with a cd command, but then I'd probably run into other issues downstream when linking up other rules to the outputs of the rule NAME.
How about simply moving it directly afterwards in the shell part (provided somecommand completes successfully)?
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
params:
output_dir = "path/to/output_dir",
shell: "somecommand {input} {output} && mv output.txt {params.output_dir}/output.txt"
EDIT: for multiple executions of NAME in parallel, combining with shadow: "full" could work:
rule NAME:
input: "path/to/inputfile"
output:
output_file = "path/to/outputfile"
output_txt = "path/to/output_dir/output.txt"
shadow: "full"
shell: "somecommand {input} {output.output_file} && mv output.txt {output.output_txt}"
That should run each execution of the rule in its own temporary dir, and by specifying the moved output.txt as an output Snakemake should move it to the real output dir once the rule is done running.
I was also thinking I could prepend somecommand with a cd command, but then I'd probably run into other issues downstream when linking up other rules to the outputs of the rule NAME.
I think you are on the right track here. Each shell block is run in a separate process with the working directory inherited from the snakemake process (specified with the --directory argument on the command line). Accordingly, cd commands in one shell block will not affect other jobs from the same rule or other downstream/upstream jobs.
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
shell:
"""
input_file=$(realpath "{input}") # get the absolute path, before the `cd`
base_dir=$(dirname "{output}")
cd "$base_dir"
somecommand ...
"""

Does snakemake support none file Input?

I get a MissingInputException when I run the following rule:
configfile: "Configs.yaml"
rule download_data_from_ZFIN:
input:
anatomy_item = config["ZFIN_url"]["anatomy_item"],
xpat_stage_anatomy = config["ZFIN_url"]["xpat_stage_anatomy"],
xpat_fish = config["ZFIN_url"]["xpat_fish"],
anatomy_synonyms = config["ZFIN_url"]["anatomy_synonyms"]
output:
anatomy_item = os.path.join(os.getcwd(), config["download_data_from_ZFIN"]["dir"], "anatomy_item.tsv"),
xpat_stage_anatomy = os.path.join(os.getcwd(), config["download_data_from_ZFIN"]["dir"], "xpat_stage_anatomy.tsv"),
xpat_fish = os.path.join(os.getcwd(), config["download_data_from_ZFIN"]["dir"], "xpat_fish.tsv"),
anatomy_synonyms = os.path.join(os.getcwd(), config["download_data_from_ZFIN"]["dir"], "anatomy_synonyms.tsv")
shell:
"wget -O {output.anatomy_item} {input.anatomy_item};" \
"wget -O {output.anatomy_synonyms} {input.anatomy_synonyms};" \
"wget -O {output.xpat_stage_anatomy} {input.xpat_stage_anatomy};" \
"wget -O {output.xpat_fish} {input.xpat_fish};"
And this is the content of my configs.yaml file:
ZFIN_url:
# Zebrafish Anatomy Term
anatomy_item: "https://zfin.org/downloads/file/anatomy_item.txt"
# Zebrafish Gene Expression by Stage and Anatomy Term
xpat_stage_anatomy: "https://zfin.org/downloads/file/xpat_stage_anatomy.txt"
# ZFIN Genes with Expression Assay Records
xpat_fish: "https://zfin.org/downloads/file/xpat_fish.txt"
# Zebrafish Anatomy Term Synonyms
anatomy_synonyms: "https://zfin.org/downloads/file/anatomy_synonyms.txt"
download_data_from_ZFIN:
dir: ZFIN_data
The error message is:
Building DAG of jobs...
MissingInputException in line 10 of /home/zhangdong/works/NGS/coevolution/snakemake/coevolution.rule:
Missing input files for rule download_data_from_ZFIN:
https://zfin.org/downloads/file/anatomy_item.txt
I want to make sure that if this exception is caused by none file input for the input rule?
Note that you can also use remote files as input so you may avoid rule download_data_from_ZFIN altogether. E.g.:
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
rule all:
input:
'output.txt',
rule one:
input:
# Some file from the web
x= HTTP.remote('https://plasmodb.org/common/downloads/release-49/PbergheiANKA/txt/PlasmoDB-49_PbergheiANKA_CodonUsage.txt', keep_local=True)
output:
'output.txt',
shell:
r"""
# Do something with the remote file
head {input.x} > {output}
"""
The remote file will be downloaded and stored locally under plasmodb.org/common/.../PlasmoDB-49_PbergheiANKA_CodonUsage.txt
Many thanks #dariober, I tried the follwing code and it worked,
import os
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
configfile: "Configs.yaml"
HTTP = HTTPRemoteProvider()
rule all:
input:
expand(os.path.join(os.getcwd(),config["download_data_from_ZFIN"]["dir"],"{item}.tsv"),
item=list(config["ZFIN_url"].keys()))
rule download_data_from_ZFIN:
input:
lambda wildcards: HTTP.remote(config["ZFIN_url"][wildcards.item], keep_local=True)
output:
os.path.join(os.getcwd(),config["download_data_from_ZFIN"]["dir"],"{item}.tsv")
threads:
1
shell:
"mv {input} > {output}"
Such code is more snakemake-like, but I have two further questions:
Is there a way to specify the output file name for the downloading? Now I use the mv command to achieve that.
Does this remote files function support parallel works? I tried the above code together with --cores 6, but it still download the file one by one.

ChildIOException when creating a subdirectory of a subdirectory

I'm looking for maybe, help or understanding toward an error.
I have an ChildIO exception when i create a subdirectory in a directory created on a previous subdirectory created by a rule.
Basicly, i've a rule that'll create a directory with a couple subdirectories and files through a first script. Then, my 2nd rule will take one pecular subdirectory and make another inside the parent directory of the subdirectory through another script. And my 3rd rule is taking on that new subdirectory, and make in it another (with others files).
I don't understand, why my rule 2 work, while the third don't
My workflow is as following :
configfile: "config.yaml"
dirname = config["dirname"].values()
script_dir = config["script_dir"]
rule all:
# Contain all output
input:
expand(["{dirname}/GFF/","{dirname}/GFF/final_gffs/", "{dirname}/GFF/roary_results/",
"{dirname}/GFF/roary_results/pangenome_multifastas/"], dirname=dirname)
rule prepa_gff:
# Transform gbff files to gff through prepare to roary
input:
expand("{dirname}/GenBank/",dirname=dirname)
output:
gff_dir = directory(expand("{dirname}/GFF/",dirname=dirname)),
gff_fin = directory(expand("{dirname}/GFF/final_gffs/",dirname=dirname))
params:
script_dir = script_dir
message:
"Converting gbff files into gff files."
run:
for dir in dirname:
shell("cd {script_dir} && python3 prepare_to_roary.py -i {dir}/GenBank -o {dir}/GFF")
rule roary:
# Launch roary, with the script itself launching the cluster for operating
input:
rules.prepa_gff.output.gff_fin
output:
dir = directory(expand("{dirname}/GFF/roary_results/", dirname=dirname)),
params:
script_dir = script_dir
message:
"Launching roary."
run:
for i in input:
shell("cd {script_dir} && python3 roary_launcher.py -i {i}")
rule cluster_fasta:
# Launch the script for creating multi-fasta files corresponding to each identified cluster
input:
rules.roary.output.dir
output:
directory(expand("{dirname}/GFF/roary_results/pangenome_multifastas/", dirname=dirname))
params:
script_dir = script_dir
message:
"Clustering in multi-fasta format."
run:
for i in input:
shell("cd {script_dir} && python3 pan_genome_maker_T.py -i {i}")
ChildIOException:
File/directory is a child to another output:
('../Sero3/GFF/roary_results', roary)
('../Sero3/GFF/roary_results/pangenome_multifastas', cluster_fasta)
There is no strict order of execution if the rules have no dependencies. Your rule all: specifies the target of 3 directories, but they are nested, so only the last is needed.
From the point of view of Snakemake, the goal is to create one directory: "{dirname}/GFF/roary_results/pangenome_multifastas/", and the rest is irrelevant. What does prepare_to_roary.py script do? I don't know, Snakemake neither.
Try to rethink your task in terms of the files that your pipeline produces, and disambiguate your intention.