Snakemake using the first argument of a list as a wildcard - snakemake

I am trying to run the analysis in snakemake where as a proband I take always the first bam file present in the list, i.e NUM_194 and NUM_123. Is there a way to use as a wildcards the first IDENTIFIER of the bam file of the list(d) in the proband line?
d = {"FAM_194": ["path/to/NUM_194/NUM_194.bam", "path/to/NUM_195/NUM_195.bam", "path/to/NUM_196/NUM_196.bam"],
"FAM_123": ["path/to/NUM_123/NUM_123.bam", "path/to/NUM_126/NUM_126.bam", "path/to/NUM_127/NUM_127.bam"]}
FAMILIES = list(d)
rule all:
input:
...
wildcard_constraints:
family = "|".join(FAMILIES)
.....some other rules
rule SelectVariants:
input:
invcf="{fam}/{fam}.vcf"
params:
ref="myref.fasta"
output:
out="{fam}/{fam}.proband.vcf",
out2="{fam}/{fam}.p.avinput"
shell:
"""
proband=NUM_194 <--- the first sample of the list(d), for example NUM_194
gatk --java-options "-Xms2G -Xmx2g -XX:ParallelGCThreads=2" SelectVariants -R {params.ref} -V {input.invcf} -sn "$proband" -O {output.out}
convert2annovar -format vcf4 --includeinfo {output.out} > {output.out2}
"""

Maybe using a function as input (lambda function here) like this?
rule SelectVariants:
input:
invcf="{fam}/{fam}.vcf",
proband= lambda wc: d[wc.fam][0],
...
shell:
"""
gatk ... -sn {input.proband} ...
"""

Related

Snakemake input two variables and output one variable

I want to rename and move my fastq.gz files from these:
NAME-BOB_S1_L001_R1_001.fastq.gz
NAME-BOB_S1_L001_R2_001.fastq.gz
NAME-JOHN_S2_L001_R1_001.fastq.gz
NAME-JOHN_S2_L001_R2_001.fastq.gz
to these:
NAME_BOB/reads/NAME_BOB.R1.fastq.gz
NAME_BOB/reads/NAME_BOB.R2.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R1.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R2.fastq.gz
This is my code. The problem I have is the second variable S which I do not know how to specify in the code as I do not need it in my output filename.
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
S=["S1","S2"]
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz", dir=DIR),
expand("NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz", dir=DIR)
rule rename:
input:
fastq1=("fastq/NAME-{dir}_{s}_L001_R1_001.fastq.gz", zip, dir=DIR, s=S),
fastq2=("fastq/NAME-{dir}_{s}_L001_R2_001.fastq.gz", zip, dir=DIR, s=S)
output:
fastq1="NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz",
fastq2="NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz"
shell:
"""
mv {input.fastq1} {output.fastq1}
mv {input.fastq2} {output.fastq2}
"""
There are several problems in your code. First of all, the {dir} in your output and {dir} in your input are two different variables. Actually the {dir} in the output is a wildcard, while the {dir} in the input is a parameter for the expand function (moreover, you even forgot to call this function, and that is the second problem).
The third problem is that the shell section shall contain only a single command. You may try mv {input.fastq1} {output.fastq1}; mv {input.fastq2} {output.fastq2}, but this is not an idiomatic solution. Much better would be to create a rule that produces a single file, letting Snakemake to do the rest of the work.
Finally the S value fully depend on the DIR value, so it becomes a function of {dir}, and that can be solved with a lambda in input:
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
dir2s = {"BOB": "S1", "JOHN": "S2"}
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz", dir=DIR, r=["R1", "R2"])
rule rename:
input:
lambda wildcards:
"fastq/NAME-{{dir}}_{s}_L001_{{r}}_001.fastq.gz".format(s=dir2s[wildcards.dir])
output:
"NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz",
shell:
"""
mv {input} {output}
"""

use directories or all files in directories as input in snakemake

I am new to snakemake. I want to use directories or all files in directories as input in snakemake. For example, two directories with different no. of bam files,
--M1
M1-1.bam
M1-2.bam
--M2
M2-3.bam
M2-5.bam
I just want to merge M1-1.bam, M1-2.bam to M1.bam; M2-3.bam, M2-5.bam to M2.bam; I tried to use wildcards and expand followd by this and this, and the code is as follows,
config.yaml
SAMPLES:
M1:
- 1
- 2
M2:
- 3
- 5
rawdata: path/to/rawdata
outpath: path/to/output
reference: path/to/reference
snakemake file
configfile:"config.yaml"
SAMPLES=config["SAMPLES"]
REFERENCE=config["reference"]
RAWDATA=config["rawdata"]
OUTPATH=config["outpath"]
ALL_INPUT = []
for key, values in SAMPLES.items():
ALL_INPUT.append(f"Map/bwa/merge/{key}.bam")
ALL_INPUT.append(f"Map/bwa/sort/{key}.sort.bam")
ALL_INPUT.append(f"Map/bwa/dup/{key}.sort.rmdup.bam")
ALL_INPUT.append(f"Map/bwa/dup/{key}.sort.rmdup.matrix")
ALL_INPUT.append(f"SNV/Mutect2/result/{key}.vcf.gz")
ALL_INPUT.append(f"Map/bwa/result/{key}")
for value in values:
ALL_INPUT.append(f"Map/bwa/result/{key}/{key}-{value}.bam")
for num in {1,2}:
ALL_INPUT.append(f"QC/fastp/{key}/{key}-{value}.R{num}.fastq.gz")
rule all:
input:
expand("{outpath}/{all_input}",all_input=ALL_INPUT,outpath=OUTPATH)
rule fastp:
input:
r1= RAWDATA + "/{key}-{value}.R1.fastq.gz",
r2= RAWDATA + "/{key}-{value}.R2.fastq.gz"
output:
a1="{outpath}/QC/fastp/{key}/{key}-{value}.R1.fastq.gz",
a2="{outpath}/QC/fastp/{key}/{key}-{value}.R2.fastq.gz"
params:
prefix="{outpath}/QC/fastp/{key}/{key}-{value}"
shell:
"""
fastp -i {input.r1} -I {input.r2} -o {output.a1} -O {output.a2} -j {params.prefix}.json -h {params.prefix}.html
"""
rule bwa:
input:
a1="{outpath}/QC/fastp/{key}/{key}-{value}.R1.fastq.gz",
a2="{outpath}/QC/fastp/{key}/{key}-{value}.R2.fastq.gz"
output:
o1="{outpath}/Map/bwa/result/{key}/{key}-{value}.bam"
params:
mem="4000",
rg="#RG\\tID:{key}\\tPL:ILLUMINA\\tSM:{key}"
shell:
"""
bwa mem -t {threads} -M -R '{params.rg}' {REFERENCE} {input.a1} {input.a2} | samtools view -b -o {output.o1}
"""
## get sample index from raw fastq
key_ids,value_ids = glob_wildcards(RAWDATA + "/{key}-{value}.R1.fastq.gz")
# remove duplicate sample name, and this is useful when there is only one sample input
key_ids = list(set(key_ids))
rule merge:
input:
expand("{outpath}/Map/bwa/result/{key}/{key}-{value}.bam",outpath=OUTPATH, key=key_ids, value=value_ids)
output:
"{outpath}/Map/bwa/merge/{key}.bam"
shell:
"""
samtools merge {output} {input}
"""
The {input} in merge command will be
M1-1.bam M1-2.bam M1-3.bam M1-5.bam M2-1.bam M2-2.bam M2-3.bam M2-5.bam
Actually, for M1 sample, the {input} should be M1-1.bam M1-2.bam; for M2, M2-3.bam M2-5.bam. I also read this, but I have no idea if there are lots of directories with different files each.
Then I tried to use directories as input, for merge rule
rule mergebam:
input:
"{outpath}/Map/bwa/result/{key}"
output:
"{outpath}/Map/bwa/merge/{key}.bam"
log:
"{outpath}/Map/bwa/log/{key}.merge.bam.log"
shell:
"""
samtools merge {output} `ls {input}/*bam` > {log} 2>&1
"""
But this give me MissingInputException error
Missing input files for rule merge:
/{outpath}/Map/bwa/result/M1
Any idea will be appreciated.
I haven't fully parsed your question but I'll give it a shot anyway... In rule merge you have:
expand("{outpath}/Map/bwa/result/{key}/{key}-{value}.bam",outpath=OUTPATH, key=key_ids, value=value_ids)
This means that you collect all combinations of outpath, key and value.
Presumably you want all combinations of value within each outpath and key. So use:
expand("{{outpath}}/Map/bwa/result/{{key}}/{{key}}-{value}.bam", value=value_ids)
if you change your config.yaml to the following, can you make the implementation easier by using expand?
SAMPLES:
M1:
- M1-1
- M2-2
M2:
- M2-3
- M2-5

How to read the config.yaml file and feed it in snakemake

I have a ref.fasta file that contains scaffolds. In order to parallelise the variant calling process I grouped scaffolds by chromosomes and created a config.yaml file as below:
samples:
chr1A: scaffold26096,scaffold40476
chr1B: scaffold11969,scaffold83281,scaffold43483
chr1D: scaffold4701,scaffold102360
And a script as below.
configfile: "config.yaml"
rule all:
input:
expand("scaffolds/{sample}.vcf", sample=config["samples"])
rule gatk:
input:
"/path/to/ref.fasta",
"/path/to/bam.list",
lambda wildcards: config["samples"][wildcards.sample]
output:
outf ="scaffolds/{sample}.vcf"
shell:
"""
/Tools/gatk/gatk --java-options "-Xmx16g -XX:ParallelGCThreads=10" HaplotypeCaller -L {input[2]} -R {input[0]} -I {input[1]} -O {output.outf}
"""
I would like to get results as chr1A.vcf, chr1B.vcf and chr1D.vcf.
This is giving me an error:
Missing input files for rule gatk:
scaffold4701,scaffold102360
What is wrong?
I think your yaml file does not contain the data you think it does. I guess you want each chromosome to contain a list of scaffolds, like:
samples:
chr1A:
- scaffold26096
- scaffold40476
chr1B:
- scaffold11969
- scaffold83281
- scaffold43483
chr1D:
- scaffold4701
- scaffold102360
(I haven't checked the rest of your code)

Snakemake tries to run rule, reason: Missing output files, but files are temporary

I have a series of rule leading into using vsearch, with the barebones shown here:
rule vsearch:
input:
"{barcode_number}.nanofilt.fastq"
output:
sam_output_file = "{barcode_number}.vsearch.txt",
fasta_input_file = "vsearch/{barcode_number}.vsearch.input.fasta")
params:
reference_file = config['alignment_reference_file']
shell:
"seqkit fq2fa {input} > {output.fasta_input_file}"
" && "
"vsearch "
"--usearch_global "
"{output.fasta_input_file} "
"--id 0 "
"--quiet "
"--db {params.reference_file} "
"--samout {output.sam_output_file}"
The rule works as expected, creating temporary files (barcode##.vsearch.input.fasta, where ## are simply numbers), running vsearch on these temp files, and deleting them afterwards. However, when performing a dry run with the workflow and including --reason, snakemake gives the following:
reason: Missing output files: /vsearch/barcode##.vsearch.input.fasta
This happens for every file (about 80 total).
Have I missed something with the temp() flag, or how can I tell snakemake that I don't need these output files, and are only needed to convert from .fastq to .fasta file types?
Thank you for any help.
I would move seqkit fq2fa to its own rule producing the temp file nedded by vsearch. Like (not tested):
rule fq2fa:
input:
"{barcode_number}.nanofilt.fastq",
output:
temp("vsearch/{barcode_number}.vsearch.input.fasta"),
shell:
r"""
seqkit fq2fa {input} > {output}
"""
rule vsearch:
input:
"vsearch/{barcode_number}.vsearch.input.fasta",
output:
sam_output_file = "{barcode_number}.vsearch.txt",
params:
reference_file = config['alignment_reference_file']
shell:
r"""
vsearch \
--usearch_global \
{input} \
--id 0 \
--quiet \
--db {params.reference_file} \
--samout {output.sam_output_file}
"""
In my opinion this is cleaner.
If you want to put seqkit and vsearch in the same rule you could do:
rule vsearch:
input:
"{barcode_number}.nanofilt.fastq"
output:
sam_output_file = "{barcode_number}.vsearch.txt",
params:
reference_file = config['alignment_reference_file']
shell:
"seqkit fq2fa {input} > vsearch/{wildcards.barcode_number}.vsearch.input.fasta"
"vsearch "
"--usearch_global "
"vsearch/{wildcards.barcode_number}.vsearch.input.fasta "
"--id 0 "
"--quiet "
"--db {params.reference_file} "
"--samout {output.sam_output_file}"
"rm vsearch/{wildcards.barcode_number}.vsearch.input.fasta"
In the way you do it, snakemake reruns vsearch rule because you ask for the fasta file in output so if you delete it via temp() or something else, snakemake will rerun the rule again.

snakemake input function strange results

I have this rules. I try to use case vs control.
CASE1,CONTROL1
CASE2,CONTROL2
CASE3,CONTROL3
rule macs2:
input: get_files
output: "ALIGN/result/macs2/{case}_vs_{control}/",
"ALIGN/result/macs2/{case}_vs_{control}/{case}_peaks.xls",
"ALIGN/result/macs2/{case}_vs_{control}/{case}_summits.bed"
log: "log/{case}_vs_{control}.macs2"
threads: 2
conda:
"envs/macs.yaml"
message: "macs2 comparison"
params:
size="hs",
name="{case}"
shell:
"""
macs2 callpeak -t {input[0]} -c {input[1]} -f BAM -g hs -n {params.name} --nomodel -B -q 0.01 --outdir {output[0]} -m 3 50 -g {params.size} --extsize 147 2>{log}
"""
So this is the function:
def get_files(wildcards):
case = wildcards.case
control = aCondition[case][0]
return ["ALIGN/result/{}_filter_dup.bam".format(case), "ALIGN/result/{}_filter_dup.bam".format(control)]
If I have this rule all. I have all the comparison of one sample with all inputs.
expand("ALIGN/result/macs2/{case}_vs_{control}/",case=CASE,control=CONTROL),
Example:
CASE1,CONTROL2
CASE1,CONTROL3
CASE1,CONTROL1
...
I expect to have only
CASE1,CONTROL1
CASE2,CONTROL2
CASE3,CONTROL3
In general: How can write the rule all where I use input function on rule?
I'm not sure what you are asking. Did you try making a rule that asks for the outputfiles you are interested in?
rule all:
input:
expand(
"ALIGN/result/macs2/{case}_vs_{control}/{case}_peaks.xls",
case=["case1", "case2", "case3"],
control=["control1", "control2", "control3"]
)