Snakemake: strip off path from input - snakemake

I want to write a pipeline in snakemake that takes an input file from config.yaml, runs a command, and writes the output to the current directory under the original filename + new suffix.
Snakefile
configfile: "config.yaml"
rule target:
input:
config["reads"]+".fasta.gz",
rule raw_convert:
input:
config["reads"]
output:
config["reads"]+".fasta.gz" # old path specified here
shell:
"sed -n '1~4s/^#/>/p;2~4p' {input} | gzip > {output}"
config.yaml
reads: /path/to/dir/myreads.fq.gz
Using bash, I would write something like to get the file myreads.fq.gz.fasta.gz:
sed -n '1~4s/^#/>/p;2~4p' ${input} | gzip >$(basename ${input}).fasta.gz

In this solution, I pair read basenames to their full path in a dict, and then use it in rules. This would fail if basenames are not unique though.
import os
d = {}
for read in config["reads"]:
basename = os.path.basename(read)
d[basename] = read
rule all:
input:
expand('{read_basename}.fasta.gz', read_basename=list(d.keys()))
rule xxx:
input:
lambda wildcards: d[wildcards.read_basename]
output:
"{read_basename}.fasta.gz"
shell:
'soemthing'
You may want to replace .fq.gz with .fasta.gz instead of appending them. For readability purposes.

I finally came up with some code that seems to do the trick:
configfile: "config.yaml"
import os
basenamereads = os.path.basename(config["reads"])
rule target:
input: expand("{myoutput}.fasta.gz", myoutput=basenamereads)
rule xxx:
input:
config["reads"]
output:
os.path.basename(config["reads"])+".fasta.gz"
shell:
"cat {input} >{output}"

Related

Using * to glob within an input file, or using multiple wildcards in input, then using only one wildcard for output?

Is there a way to write a rule so that I don't need to use the wildcards for all inputs/outputs or can I use a "*" to glob rather than using wildcards?? I want to symlink a file that is autocreated in subfolders, to the main directory.
This is the error I get when trying to run the Snakemake:
WildcardError in line 42 of snakemake_guppy_basecall/Snakefile:
Wildcards in input files cannot be determined from output files:
'failpass'
import glob
configfile: "config.yaml"
inputdirectory=config["directory"]
SAMPLES, = glob_wildcards(inputdirectory+"/{sample}.fast5", followlinks=True)
print(SAMPLES)
wildcard_constraints:
sample="\w+\d+_\w+_\w+\d+_.+_\d"
##### target rules #####
rule all:
input:
#expand('basecall/{sample}/sequencing_summary.txt', sample=SAMPLES),
"qc/multiqc.html"
rule make_indvidual_samplefiles:
input:
inputdirectory+"/{sample}.fast5",
output:
"lists/{sample}.txt",
shell:
"basename {input} > {output}"
rule guppy_basecall_persample:
input:
directory=directory(inputdirectory),
samplelist="lists/{sample}.txt",
output:
summary="basecall/{sample}/sequencing_summary.txt",
directory=directory("basecall/{sample}/"),
params:
config["basealgo"]
shell:
"guppy_basecaller -i {input.directory} --input_file_list {input.samplelist} -s {output.directory} -c {params} --compress_fastq -x \"auto\" --gpu_runners_per_device 3 --num_callers 2 --chunks_per_runner 200"
rule guppy_linkfastq:
input:
#glob_wildcards("basecall/{sample}/*/*.fastq.gz"),
"basecall/{sample}/{failpass}/{runid}.fastq.gz",
output:
"basecall/{sample}.fastq.gz",
shell:
"ln -s {input} {output}"
rule fastqc_pretrim:
input:
#"basecall/{sample}/{failpass}/{runid}.fastq.gz",
"basecall/{sample}.fastq.gz"
output:
html="qc/fastqc_pretrim/{sample}.html",
zip="qc/fastqc_pretrim/{sample}_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: ""
log:
"logs/fastqc_pretrim/{sample}.log"
threads: 1
wrapper:
"v0.75.0/bio/fastqc"
rule multiqc:
input:
#expand("basecall/{sample}.fastq.gz", sample=SAMPLES)
expand("qc/fastqc_pretrim/{sample}_fastqc.zip", sample=SAMPLES)
output:
"qc/multiqc.html"
params:
"" # Optional: extra parameters for multiqc.
log:
"logs/multiqc.log"
wrapper:
"0.77.0/bio/multiqc"
I am trying to create a pipeline that does: Get Nanopore f5 sequence files -> run guppy basecaller GPU mode -> use resulting fastq files to run FASTQC -> run multiQC for everything

snakemake - define input for aggregate rule without wildcards

I am writing a snakemake to produce Sars-Cov-2 variants from Nanopore sequencing. The pipeline that I am writing is based on the artic network, so I am using artic guppyplex and artic minion.
The snakemake that I wrote has the following steps:
zip all the fastq files for all barcodes (rule zipFq)
perform read filtering with guppyplex (rule guppyplex)
call the artic minion pipeline (rule minion)
move the stderr and stdout from qsub to a folder under the working directory (rule mvQsubLogs)
Below is the snakemake that I wrote so far, which works
barcodes = ['barcode49', 'barcode50', 'barcode51']
rule all:
input:
expand([
# zip fq
"zipFastq/{barcode}/{barcode}.zip",
# guppyplex
"guppyplex/{barcode}/{barcode}.fastq",
# nanopolish
"nanopolish/{barcode}",
# directory where the logs will be moved to
"logs/{barcode}"
], barcode = barcodes)
rule zipFq:
input:
FQ = f"{FASTQ_PATH}/{{barcode}}"
output:
"zipFastq/{barcode}/{barcode}.zip"
shell:
"zip {output} {input.FQ}/*"
rule guppyplex:
input:
FQ = f"{FASTQ_PATH}/{{barcode}}" # FASTQ_PATH is parsed from config.yaml
output:
"guppyplex/{barcode}/{barcode}.fastq"
shell:
"/home/ngs/miniconda3/envs/artic-ncov2019/bin/artic guppyplex --skip-quality-check --min-length {MINLENGTHGUPPY} --max-length {MAXLENGTHGUPPY} --directory {input.FQ} --prefix {wildcards.barcode} --output {output}" # variables in CAPITALS are parsed from config.yaml
rule minion:
input:
INFQ = rules.guppyplex.output,
FAST5 = f"{FAST5_PATH}/{{barcode}}"
params:
OUTDIR = "nanopolish/{barcode}"
output:
directory("nanopolish/{barcode}")
shell:
"""
mkdir {params.OUTDIR};
cd {params.OUTDIR};
export PATH=/home/ngs/miniconda3/envs/artic-ncov2019/bin:$PATH;
artic minion --normalise {NANOPOLISH_NORMALISE} --threads {THREADS} --scheme-directory {PRIMERSDIR} --read-file ../../{input.INFQ} --sequencing-summary {Seq_Sum} --fast5-directory {input.FAST5} nCoV-2019/{PRIMERVERSION} {wildcards.barcode} # variables in CAPITALS are parsed from config.yaml
"""
rule mvQsubLogs:
input:
# zipFQ
rules.zipFq.output,
# guppyplex
rules.guppyplex.output,
# nanopolish
rules.minion.output
output:
directory("logs/{barcode}")
shell:
"mkdir -p {output} \n"
"mv {LOGDIR}/{wildcards.barcode}* {output}/"
The above snakemake works and now I am trying to add another rule, but the difference here is that this rule is an aggregate function i.e. it should not be called for every barcode, but only once after all the rules are called for all barcodes
The rule that I am trying to incorporate (catFasta) would cat all {barcode}.consensus.fasta (generated by rule minion) into in a single file, as shown below (incorporated into the snakemake above):
barcodes = ['barcode49', 'barcode50', 'barcode51']
rule all:
input:
expand([
# zip fq
"zipFastq/{barcode}/{barcode}.zip",
# guppyplex
"guppyplex/{barcode}/{barcode}.fastq",
# nanopolish
"nanopolish/{barcode}",
# catFasta
"catFasta/cat_consensus.fasta",
# directory where the logs will be moved to
"logs/{barcode}"
], barcode = barcodes)
rule zipFq:
input:
FQ = f"{FASTQ_PATH}/{{barcode}}"
output:
"zipFastq/{barcode}/{barcode}.zip"
shell:
"zip {output} {input.FQ}/*"
rule guppyplex:
input:
FQ = f"{FASTQ_PATH}/{{barcode}}" # FASTQ_PATH is parsed from config.yaml
output:
"guppyplex/{barcode}/{barcode}.fastq"
shell:
"/home/ngs/miniconda3/envs/artic-ncov2019/bin/artic guppyplex --skip-quality-check --min-length {MINLENGTHGUPPY} --max-length {MAXLENGTHGUPPY} --directory {input.FQ} --prefix {wildcards.barcode} --output {output}" # variables in CAPITALS are parsed from config.yaml
rule minion:
input:
INFQ = rules.guppyplex.output,
FAST5 = f"{FAST5_PATH}/{{barcode}}"
params:
OUTDIR = "nanopolish/{barcode}"
output:
directory("nanopolish/{barcode}")
shell:
"""
mkdir {params.OUTDIR};
cd {params.OUTDIR};
export PATH=/home/ngs/miniconda3/envs/artic-ncov2019/bin:$PATH;
artic minion --normalise {NANOPOLISH_NORMALISE} --threads {THREADS} --scheme-directory {PRIMERSDIR} --read-file ../../{input.INFQ} --sequencing-summary {Seq_Sum} --fast5-directory {input.FAST5} nCoV-2019/{PRIMERVERSION} {wildcards.barcode} # variables in CAPITALS are parsed from config.yaml
"""
rule catFasta:
input:
expand("nanopolish/{barcode}/{barcode}.consensus.fasta", barcode = barcodes)
output:
"catFasta/cat_consensus.fasta"
shell:
"cat {input} > {output}"
rule mvQsubLogs:
input:
# zipFQ
rules.zipFq.output,
# guppyplex
rules.guppyplex.output,
# nanopolish
rules.minion.output,
# catFasta
rules.catFasta.output
output:
directory("logs/{barcode}")
shell:
"mkdir -p {output} \n"
"mv {LOGDIR}/{wildcards.barcode}* {output}/"
However, when I call snakemake with
(artic-ncov2019) ngs#bngs05b:/nexusb/SC2/ONT/scripts/SnakeMake> snakemake -np -s Snakefile_v2 --cluster "qsub -q onlybngs05b -e {LOGDIR} -o {LOGDIR} -j y" -j 5 --jobname "{wildcards.barcode}.{rule}.{jobid}" all # LOGDIR parsed from config.yaml
I get:
Building DAG of jobs...
MissingInputException in line 178 of /nexusb/SC2/ONT/scripts/SnakeMake/Snakefile_v2:
Missing input files for rule guppyplex:
/nexus/Gridion/20210521_Covid7/Covid7/20210521_0926_X1_FAL11796_a5b62ac2/fastq_pass/barcode49/barcode49.consensus.fasta
Which I don't find easy to understand: snakemake is complaining about /nexus/Gridion/20210521_Covid7/Covid7/20210521_0926_X1_FAL11796_a5b62ac2/fastq_pass/barcode49/barcode49.consensus.fasta whereas /nexus/Gridion/20210521_Covid7/Covid7/20210521_0926_X1_FAL11796_a5b62ac2/fastq_pass/ is FASTQ_PATH and I am not defining f"{FASTQ_PATH}/{{barcode}}.consensus.fasta" anywhere
A very same problem is described here, though the strategy in the accepted answer (the input for rule catFasta would be expand("nanopolish/{{barcode}}/{{barcode}}.consensus.fasta")) does not work for me.
Does anyone know how I can circumvent this?
The rule that fails is rule guppyplex, which looks for an input in the form of {FASTQ_PATH}/{{barcode}}.
Looks like the wildcard {barcode} is filled with barcode49/barcode49.consensus.fasta, which happened because of two reasons I think:
First (and most important): The workflow does not find a better way to produce the final output. In rule catFasta, you give an input file which is never described as an output in your workflow. The rule minion has the directory as an output, but not the file, and it is not perfectly clear for the workflow where to produce this input file.
It therefore infers that the {barcode} wildcard somehow has to contain this .consensus.fasta that it has never seen before. This wildcard is then handed over to the top, where the workflow crashes since it cannot find a matching input file.
Second: This initialisation of the wildcard with sth. you don't want is only possible since you did not constrain the wildcard properly. You can for example forbid the wildcard to contain a . (see wildcard_constraints here)
However, the main problem is that catFasta does not find the desired input. I'd suggest changing the output of minion to "nanopolish/{barcode}/{barcode}.consensus.fasta", since the you already take the OUTDIR from the params, that should not hurt your rule here.
Edit: Dummy test example:
barcodes = ['barcode49', 'barcode50', 'barcode51']
rule all:
input:
expand([
# guppyplex
"guppyplex/{barcode}/{barcode}.fastq",
# catFasta
"catFasta/cat_consensus.fasta",
], barcode = barcodes)
rule guppyplex:
input:
FQ = f"fastq/{{barcode}}" # FASTQ_PATH is parsed from config.yaml
output:
"guppyplex/{barcode}/{barcode}.fastq"
shell:
"touch {output}" # variables in CAPITALS are parsed from config.yaml
rule minion:
input:
INFQ = rules.guppyplex.output,
FAST5 = f"fasta/{{barcode}}"
params:
OUTDIR = "nanopolish/{barcode}"
output:
"nanopolish/{barcode}/{barcode}.consensus.fasta"
shell:
"""
touch {output} && echo {wildcards.barcode} > {output}
"""
rule catFasta:
input:
expand("nanopolish/{barcode}/{barcode}.consensus.fasta", barcode = barcodes)
output:
"catFasta/cat_consensus.fasta"
shell:
"cat {input} > {output}"

Does snakemake support none file Input?

I get a MissingInputException when I run the following rule:
configfile: "Configs.yaml"
rule download_data_from_ZFIN:
input:
anatomy_item = config["ZFIN_url"]["anatomy_item"],
xpat_stage_anatomy = config["ZFIN_url"]["xpat_stage_anatomy"],
xpat_fish = config["ZFIN_url"]["xpat_fish"],
anatomy_synonyms = config["ZFIN_url"]["anatomy_synonyms"]
output:
anatomy_item = os.path.join(os.getcwd(), config["download_data_from_ZFIN"]["dir"], "anatomy_item.tsv"),
xpat_stage_anatomy = os.path.join(os.getcwd(), config["download_data_from_ZFIN"]["dir"], "xpat_stage_anatomy.tsv"),
xpat_fish = os.path.join(os.getcwd(), config["download_data_from_ZFIN"]["dir"], "xpat_fish.tsv"),
anatomy_synonyms = os.path.join(os.getcwd(), config["download_data_from_ZFIN"]["dir"], "anatomy_synonyms.tsv")
shell:
"wget -O {output.anatomy_item} {input.anatomy_item};" \
"wget -O {output.anatomy_synonyms} {input.anatomy_synonyms};" \
"wget -O {output.xpat_stage_anatomy} {input.xpat_stage_anatomy};" \
"wget -O {output.xpat_fish} {input.xpat_fish};"
And this is the content of my configs.yaml file:
ZFIN_url:
# Zebrafish Anatomy Term
anatomy_item: "https://zfin.org/downloads/file/anatomy_item.txt"
# Zebrafish Gene Expression by Stage and Anatomy Term
xpat_stage_anatomy: "https://zfin.org/downloads/file/xpat_stage_anatomy.txt"
# ZFIN Genes with Expression Assay Records
xpat_fish: "https://zfin.org/downloads/file/xpat_fish.txt"
# Zebrafish Anatomy Term Synonyms
anatomy_synonyms: "https://zfin.org/downloads/file/anatomy_synonyms.txt"
download_data_from_ZFIN:
dir: ZFIN_data
The error message is:
Building DAG of jobs...
MissingInputException in line 10 of /home/zhangdong/works/NGS/coevolution/snakemake/coevolution.rule:
Missing input files for rule download_data_from_ZFIN:
https://zfin.org/downloads/file/anatomy_item.txt
I want to make sure that if this exception is caused by none file input for the input rule?
Note that you can also use remote files as input so you may avoid rule download_data_from_ZFIN altogether. E.g.:
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
rule all:
input:
'output.txt',
rule one:
input:
# Some file from the web
x= HTTP.remote('https://plasmodb.org/common/downloads/release-49/PbergheiANKA/txt/PlasmoDB-49_PbergheiANKA_CodonUsage.txt', keep_local=True)
output:
'output.txt',
shell:
r"""
# Do something with the remote file
head {input.x} > {output}
"""
The remote file will be downloaded and stored locally under plasmodb.org/common/.../PlasmoDB-49_PbergheiANKA_CodonUsage.txt
Many thanks #dariober, I tried the follwing code and it worked,
import os
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
configfile: "Configs.yaml"
HTTP = HTTPRemoteProvider()
rule all:
input:
expand(os.path.join(os.getcwd(),config["download_data_from_ZFIN"]["dir"],"{item}.tsv"),
item=list(config["ZFIN_url"].keys()))
rule download_data_from_ZFIN:
input:
lambda wildcards: HTTP.remote(config["ZFIN_url"][wildcards.item], keep_local=True)
output:
os.path.join(os.getcwd(),config["download_data_from_ZFIN"]["dir"],"{item}.tsv")
threads:
1
shell:
"mv {input} > {output}"
Such code is more snakemake-like, but I have two further questions:
Is there a way to specify the output file name for the downloading? Now I use the mv command to achieve that.
Does this remote files function support parallel works? I tried the above code together with --cores 6, but it still download the file one by one.

snakemake: copying a file to multiple folders

I'm new to snakemake. I have a rule to copy a file to multiple folders. The folders are made in python.
I must be misunderstanding something about working with multiple targets.
The following code, when run with "Snakemake practice_phased_reversed.vcf"
returns "No rule to produce practice_phased_reversed.vcf"
s=['k_1','k2_10']
fullfs = []
import os
cdir = os.getcwd()
for f in fs:
path = os.path.join(cdir,f)
fullfs.append(path)
try:
os.mkdir(path)
except:
pass
rule r1:
input:
"{basename}_phased_reversed.vcf"
output:
expand("{f}/{{basename}}_phased_reversed.vcf",f=fullfs)
shell:
"cp {input} {output}"
Thanks to Maarten-vd-Sande. Now this works using "snakemake" i.e. without passing a target file name.
dirs=['k_1','k2_10']
rule all:
input:
expand("{f}/practice_phased_reversed.vcf",f=dirs)
rule r1:
input:
"practice_phased_reversed.vcf"
output:
"{f}/practice_phased_reversed.vcf"
shell:
"cp {input} {output}"
But I am still missing something. I need to be able to call this with a target, i.e. "snakemake practice_phased_reversed.vcf" Thanks for any help.

snakemake - output one only file from multiple input files in one rule

I'm using snakemake for the first time in order to build a basic pipeline using cutadapt, bwa and GATK (trimming ; mapping ; calling). I would like to run this pipeline on every fastq file contained in a directory, without having to specify their name or whatever in the snakefile or in the config file. I would like to succeed in doing this.
The first two steps (cutadapt and bwa / trimming and mapping) are running fine, but I'm encountering some problems with GATK.
First, I have to generate g.vcf files from bam files. I'm doing this using these rules:
configfile: "config.yaml"
import os
import glob
rule all:
input:
"merge_calling.g.vcf"
rule cutadapt:
input:
read="data/Raw_reads/{sample}_R1_{run}.fastq.gz",
read2="data/Raw_reads/{sample}_R2_{run}.fastq.gz"
output:
R1=temp("trimmed_reads/{sample}_R1_{run}.fastq.gz"),
R2=temp("trimmed_reads/{sample}_R2_{run}.fastq.gz")
threads:
10
shell:
"cutadapt -q {config[Cutadapt][Quality_value]} -m {config[Cutadapt][min_length]} -a {config[Cutadapt][forward_adapter]} -A {config[Cutadapt][reverse_adapter]} -o {output.R1} -p '{output.R2}' {input.read} {input.read2}"
rule bwa_map:
input:
genome="data/genome.fasta",
read=expand("trimmed_reads/{{sample}}_{pair}_{{run}}.fastq.gz", pair=["R1", "R2"])
output:
temp("mapped_bam/{sample}_{run}.bam")
threads:
10
params:
rg="#RG\\tID:{sample}\\tPL:ILLUMINA\\tSM:{sample}"
shell:
"bwa mem -t 2 -R '{params.rg}' {input.genome} {input.read} | samtools view -Sb - > {output}"
rule picard_sort:
input:
"mapped_bam/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"java -Xmx4g -jar /home/alexandre/picard-tools/picard.jar SortSam I={input} O={output} SO=coordinate VALIDATION_STRINGENCY=SILENT"
rule picard_rmdup:
input:
bam="sorted_reads/{sample}.bam"
output:
"rmduped_reads/{sample}.bam",
"picard_stats/{sample}.bam"
params:
reads="rmduped_reads/{sample}.bam",
stats="picard_stats/{sample}.bam",
shell:
"java -jar -Xmx2g /home/alexandre/picard-tools/picard.jar MarkDuplicates "
"I={input.bam} "
"O='{params.reads}' "
"VALIDATION_STRINGENCY=SILENT "
"MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 "
"REMOVE_DUPLICATES=TRUE "
"M='{params.stats}'"
rule samtools_index:
input:
"rmduped_reads/{sample}.bam"
output:
"rmduped_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule GATK_raw_calling:
input:
bam="rmduped_reads/{sample}.bam",
bai="rmduped_reads/{sample}.bam.bai",
genome="data/genome.fasta"
output:
"Raw_calling/{sample}.g.vcf",
shell:
"java -Xmx4g -jar /home/alexandre/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -ploidy 2 --emitRefConfidence GVCF -T HaplotypeCaller -R {input.genome} -I {input.bam} --genotyping_mode DISCOVERY -o {output}"
These rules work fine. For example, if I have the files :
Cla001d_S281_L001_R1_001.fastq.gz
Cla001d_S281_L001_R2_001.fastq.gz
I can create one bam file (Cla001d_S281_L001_001.bam) and from that bam file create a GVCF file (Cla001d_S281_L001_001.g.vcf). I have a lot of sample like this one, and I need to create one GVCF file for each, and then merge these GVCF files into one file. The problem is that I'm unable to give the list of the file to merge to the following rule:
rule GATK_merge:
input:
???
output:
"merge_calling.g.vcf"
shell:
"java -Xmx4g -jar /home/alexandre/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar "
"-T CombineGVCFs "
"-R data/genome.fasta "
"--variant {input} "
"-o {output}"
I tried several things in order to do that, but cannot succeed. The problem is the link between the two rules (GATK_raw_calling and GATK_merge that is supposed to merge the output of GATK_raw_calling). I can't output one single file if I'm specifying the output of GATK_raw_calling as the input of the following rule (Wildcards in input files cannot be determined from output files), and I'm unable to make a link between the two rules if I'm not specifying these files as an input...
Is there a way to succeed in doing that? The difficulty is that I'm not defining a list of names or whatever, I think.
Thanks you in advance for your help.
You can try to generate a list of sample IDs using glob_wildcards on the initial fastq.gz files:
sample_ids, run_ids = glob_wildcards("data/Raw_reads/{sample}_R1_{run}.fastq.gz")
Then, you can use this to expand the input of GATK_merge:
rule GATK_merge:
input:
expand("Raw_calling/{sample}_{run}.g.vcf",
sample=sample_ids, run=run_ids)
If the same run ID always come with the same sample ID, you will need to zip instead of expanding, in order to avoid non-existing combinations:
rule GATK_merge:
input:
["Raw_calling/{sample}_{run}.g.vcf".format(
sample=sample_id,
run=run_id) for sample_id, run_id in zip(sample_ids, run_ids)]
You can achieve this by using a python function as an input for your rule, as described in the snakemake documentation here.
Could look like this for example:
# Define input files
def gatk_inputs(wildcards):
files = expand("Raw_calling/{sample}.g.vcf", sample=<samples list>)
return files
# Rule
rule gatk:
input: gatk_inputs
output: <output file name>
run: ...
Hope this helps.