How to read the config.yaml file and feed it in snakemake - snakemake

I have a ref.fasta file that contains scaffolds. In order to parallelise the variant calling process I grouped scaffolds by chromosomes and created a config.yaml file as below:
samples:
chr1A: scaffold26096,scaffold40476
chr1B: scaffold11969,scaffold83281,scaffold43483
chr1D: scaffold4701,scaffold102360
And a script as below.
configfile: "config.yaml"
rule all:
input:
expand("scaffolds/{sample}.vcf", sample=config["samples"])
rule gatk:
input:
"/path/to/ref.fasta",
"/path/to/bam.list",
lambda wildcards: config["samples"][wildcards.sample]
output:
outf ="scaffolds/{sample}.vcf"
shell:
"""
/Tools/gatk/gatk --java-options "-Xmx16g -XX:ParallelGCThreads=10" HaplotypeCaller -L {input[2]} -R {input[0]} -I {input[1]} -O {output.outf}
"""
I would like to get results as chr1A.vcf, chr1B.vcf and chr1D.vcf.
This is giving me an error:
Missing input files for rule gatk:
scaffold4701,scaffold102360
What is wrong?

I think your yaml file does not contain the data you think it does. I guess you want each chromosome to contain a list of scaffolds, like:
samples:
chr1A:
- scaffold26096
- scaffold40476
chr1B:
- scaffold11969
- scaffold83281
- scaffold43483
chr1D:
- scaffold4701
- scaffold102360
(I haven't checked the rest of your code)

Related

Snakemake using the first argument of a list as a wildcard

I am trying to run the analysis in snakemake where as a proband I take always the first bam file present in the list, i.e NUM_194 and NUM_123. Is there a way to use as a wildcards the first IDENTIFIER of the bam file of the list(d) in the proband line?
d = {"FAM_194": ["path/to/NUM_194/NUM_194.bam", "path/to/NUM_195/NUM_195.bam", "path/to/NUM_196/NUM_196.bam"],
"FAM_123": ["path/to/NUM_123/NUM_123.bam", "path/to/NUM_126/NUM_126.bam", "path/to/NUM_127/NUM_127.bam"]}
FAMILIES = list(d)
rule all:
input:
...
wildcard_constraints:
family = "|".join(FAMILIES)
.....some other rules
rule SelectVariants:
input:
invcf="{fam}/{fam}.vcf"
params:
ref="myref.fasta"
output:
out="{fam}/{fam}.proband.vcf",
out2="{fam}/{fam}.p.avinput"
shell:
"""
proband=NUM_194 <--- the first sample of the list(d), for example NUM_194
gatk --java-options "-Xms2G -Xmx2g -XX:ParallelGCThreads=2" SelectVariants -R {params.ref} -V {input.invcf} -sn "$proband" -O {output.out}
convert2annovar -format vcf4 --includeinfo {output.out} > {output.out2}
"""
Maybe using a function as input (lambda function here) like this?
rule SelectVariants:
input:
invcf="{fam}/{fam}.vcf",
proband= lambda wc: d[wc.fam][0],
...
shell:
"""
gatk ... -sn {input.proband} ...
"""

use directories or all files in directories as input in snakemake

I am new to snakemake. I want to use directories or all files in directories as input in snakemake. For example, two directories with different no. of bam files,
--M1
M1-1.bam
M1-2.bam
--M2
M2-3.bam
M2-5.bam
I just want to merge M1-1.bam, M1-2.bam to M1.bam; M2-3.bam, M2-5.bam to M2.bam; I tried to use wildcards and expand followd by this and this, and the code is as follows,
config.yaml
SAMPLES:
M1:
- 1
- 2
M2:
- 3
- 5
rawdata: path/to/rawdata
outpath: path/to/output
reference: path/to/reference
snakemake file
configfile:"config.yaml"
SAMPLES=config["SAMPLES"]
REFERENCE=config["reference"]
RAWDATA=config["rawdata"]
OUTPATH=config["outpath"]
ALL_INPUT = []
for key, values in SAMPLES.items():
ALL_INPUT.append(f"Map/bwa/merge/{key}.bam")
ALL_INPUT.append(f"Map/bwa/sort/{key}.sort.bam")
ALL_INPUT.append(f"Map/bwa/dup/{key}.sort.rmdup.bam")
ALL_INPUT.append(f"Map/bwa/dup/{key}.sort.rmdup.matrix")
ALL_INPUT.append(f"SNV/Mutect2/result/{key}.vcf.gz")
ALL_INPUT.append(f"Map/bwa/result/{key}")
for value in values:
ALL_INPUT.append(f"Map/bwa/result/{key}/{key}-{value}.bam")
for num in {1,2}:
ALL_INPUT.append(f"QC/fastp/{key}/{key}-{value}.R{num}.fastq.gz")
rule all:
input:
expand("{outpath}/{all_input}",all_input=ALL_INPUT,outpath=OUTPATH)
rule fastp:
input:
r1= RAWDATA + "/{key}-{value}.R1.fastq.gz",
r2= RAWDATA + "/{key}-{value}.R2.fastq.gz"
output:
a1="{outpath}/QC/fastp/{key}/{key}-{value}.R1.fastq.gz",
a2="{outpath}/QC/fastp/{key}/{key}-{value}.R2.fastq.gz"
params:
prefix="{outpath}/QC/fastp/{key}/{key}-{value}"
shell:
"""
fastp -i {input.r1} -I {input.r2} -o {output.a1} -O {output.a2} -j {params.prefix}.json -h {params.prefix}.html
"""
rule bwa:
input:
a1="{outpath}/QC/fastp/{key}/{key}-{value}.R1.fastq.gz",
a2="{outpath}/QC/fastp/{key}/{key}-{value}.R2.fastq.gz"
output:
o1="{outpath}/Map/bwa/result/{key}/{key}-{value}.bam"
params:
mem="4000",
rg="#RG\\tID:{key}\\tPL:ILLUMINA\\tSM:{key}"
shell:
"""
bwa mem -t {threads} -M -R '{params.rg}' {REFERENCE} {input.a1} {input.a2} | samtools view -b -o {output.o1}
"""
## get sample index from raw fastq
key_ids,value_ids = glob_wildcards(RAWDATA + "/{key}-{value}.R1.fastq.gz")
# remove duplicate sample name, and this is useful when there is only one sample input
key_ids = list(set(key_ids))
rule merge:
input:
expand("{outpath}/Map/bwa/result/{key}/{key}-{value}.bam",outpath=OUTPATH, key=key_ids, value=value_ids)
output:
"{outpath}/Map/bwa/merge/{key}.bam"
shell:
"""
samtools merge {output} {input}
"""
The {input} in merge command will be
M1-1.bam M1-2.bam M1-3.bam M1-5.bam M2-1.bam M2-2.bam M2-3.bam M2-5.bam
Actually, for M1 sample, the {input} should be M1-1.bam M1-2.bam; for M2, M2-3.bam M2-5.bam. I also read this, but I have no idea if there are lots of directories with different files each.
Then I tried to use directories as input, for merge rule
rule mergebam:
input:
"{outpath}/Map/bwa/result/{key}"
output:
"{outpath}/Map/bwa/merge/{key}.bam"
log:
"{outpath}/Map/bwa/log/{key}.merge.bam.log"
shell:
"""
samtools merge {output} `ls {input}/*bam` > {log} 2>&1
"""
But this give me MissingInputException error
Missing input files for rule merge:
/{outpath}/Map/bwa/result/M1
Any idea will be appreciated.
I haven't fully parsed your question but I'll give it a shot anyway... In rule merge you have:
expand("{outpath}/Map/bwa/result/{key}/{key}-{value}.bam",outpath=OUTPATH, key=key_ids, value=value_ids)
This means that you collect all combinations of outpath, key and value.
Presumably you want all combinations of value within each outpath and key. So use:
expand("{{outpath}}/Map/bwa/result/{{key}}/{{key}}-{value}.bam", value=value_ids)
if you change your config.yaml to the following, can you make the implementation easier by using expand?
SAMPLES:
M1:
- M1-1
- M2-2
M2:
- M2-3
- M2-5

Snakemake: MissingInputException with inconsistent naming scheme

I am trying to process MinION cDNA amplicons using Porechop with Minimap2 and I am getting this error.
MissingInputException in line 16 of /home/sean/Desktop/reo/antisera project/20200813/MinIONAmplicon.smk:
Missing input files for rule minimap2:
8413_19_strict/BC01.fastq.g
I understand what the error telling me, I just understand why its being its not trying to make the rule before it. Porechop is being used to check for all the possible barcodes and will output more than one fastq file if it finds more than barcode in the directory. However since I know what barcode I am looking for I made a barcodes section in the config.yaml file so I can map them together.
I think the error is happening because my target output for Porechop doesn't match the input for minimap2 but I do not know how to correct this problem as there can be multiple outputs from porechop.
I thought I was building a path for the input file for the minimap2 rule and when snakemake discovered that the porechop output was not there it would make it, but that is not what is happening.
Here is my pipeline so far,
configfile: "config.yaml"
rule all:
input:
expand("{sample}.bam", sample = config["samples"])
rule porechop_strict:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
directory("{sample}_strict/")
shell:
"porechop -i {input} -b {output} --barcode_threshold 85 --threads 8 --require_two_barcodes"
rule minimap2:
input:
lambda wildcards: "{sample}_strict/" + config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output}"
and the yaml file
samples: {
'8413_19': relabeled_reads/8413_19.raw.fastq.gz,
'8417_19': relabeled_reads/8417_19.raw.fastq.gz,
'8445_19': relabeled_reads/8445_19.raw.fastq.gz,
'8466_19_104': relabeled_reads/8466_19_104.raw.fastq.gz,
'8466_19_105': relabeled_reads/8466_19_105.raw.fastq.gz,
'8467_20': relabeled_reads/8467_20.raw.fastq.gz,
}
barcodes: {
'8413_19': BC01.fastq.gz,
'8417_19': BC02.fastq.gz,
'8445_19': BC03.fastq.gz,
'8466_19_104': BC04.fastq.gz,
'8466_19_105': BC05.fastq.gz,
'8467_20': BC06.fastq.gz,
}
First of all, you can always debug the problems like that specifying the flag --printshellcmds. That would print all shell commands that Snakemake runs under the hood; you may try to run them manually and locate the problem.
As for why your rule doesn't produce any output, my guess is that samtools requires explicit filenames or - to use stdin:
Samtools is designed to work on a stream. It regards an input file '-'
as the standard input (stdin) and an output file '-' as the standard
output (stdout). Several commands can thus be combined with Unix
pipes. Samtools always output warning and error messages to the
standard error output (stderr).
So try that:
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output} -"
So I am not 100% sure why this way works, I imagine it has to do with the way snakemake looks at the targets however here is the solution I found for it.
rule minimap2:
input:
"{sample}_strict"
params:
suffix=lambda wildcards: config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../consensus.fasta\
{input}/{params.suffix} | samtools sort -o {output}"
by using the params feature in snakemake I was able to match up the correct barcode to the sample name. I am not sure why I could just do that as the input itself, but when I returned the input to the match the output of the previous rule it works.

How to pass variable value as input in snakemake?

I want to download the fastq files from SRA database using SRR ID using Snakemake. I read a file to get SRR ID using python code.
I want to parse the Variable one by one as input. My code is below.
I want to run command
fastq-dump SRR390728
#SAMPLES = ['SRR390728','SRR400816']
SAMPLES = [line.strip() for line in open("./srrList", 'r')]
rule all:
input:
expand("fastq/{sample}.fastq.log",sample=SAMPLES)
rule download_fastq:
input:
"{sample}"
output:
"fastq/{sample}.fastq.log"
shell:
"fastq-dump {input} > {output}"
Skip input and just call the wildcard in shell command. input needs to be a filepath that needs to already exist or be created as part of the pipeline - neither are true in your case.
rule download_fastq:
output:
"fastq/{sample}.fastq.log"
shell:
"fastq-dump {wildcards.sample} > {output}"

Snakemake run rule with wildcarded output of previous rules multiple times

I have multiple studies and I must make two files (a .notsad and .txt file) for each of the n number of studies. After these are created, I must run a command which runs per chromosome and uses the same two input files (.notsad, .txt) for each chromosome within a given study. So:
mycommand.py study1.notsad study1_filter.txt chr1.bad.gz --out chr1_filter.bad.gz
mycommand.py study1.notsad study1_filter.txt chr2.bad.gz --out chr2_filter.bad.gz
...
mycommand.py study2.notsad study2_filter.txt chr1.bad.gz --out chr1_filter.bad.gz
...
However Im having trouble getting this to run. Im getting an error:
WildcardError in line 33 of /scripts/Snakefile:
Wildcards in input files cannot be determined from output files:
'ds_lower'
My rules so far:
import os
import glob
ROOT = "/rootdir/"
ORIGINAL_DATA_FOLDER="original/"
PROCESS_DATA_FOLDER="process/"
ORIGINAL_DATA_SOURCE=ROOT+ORIGINAL_DATA_FOLDER
PROCESS_DATA_SOURCE=ROOT+PROCESS_DATA_FOLDER
DATASETS = [name for name in os.listdir(ORIGINAL_DATA_SOURCE) if os.path.isdir(os.path.join(ORIGINAL_DATA_SOURCE, name))]
LOWERCASE_DATASETS = [dataset.lower() for dataset in DATASETS]
CHROMOSOME = list(range(1,23))
rule all:
input:
expand(PROCESS_DATA_SOURCE+"{ds}/chr{chr}_filtered.gen.gz", ds=DATASETS, chr=CHROMOSOME)
rule run_command:
input:
ORIGINAL_DATA_SOURCE+"{ds}/chr{chr}.bad.gz", # Matches 22 chroms
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}_filter.txt", # But this should be common to all chr runs for this study.
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}.notsad" # This one as well.
output:
PROCESS_DATA_SOURCE+"{ds}/chr{chr}_filtered.gen.gz"
shell:
# Run command that uses each of the previous files and runs per chromosome
"mycommand.py {input.2} {input.1} {input.0} --out {output}"
rule write_txt_file:
input:
ORIGINAL_DATA_SOURCE+"{ds}/{ds_lower}_info.txt"
output:
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}_filter.txt"
shell:
"touch {output}"
rule write_notsad_file:
input:
ORIGINAL_DATA_SOURCE+"{ds}/_{ds_lower}.sad"
output:
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}.notsad"
shell:
"touch {output}"
UPDATE
Changing inputs for rule run_command to lambda functions did work.
rule run_command:
input:
ORIGINAL_DATA_SOURCE+"{ds}/chr{chr}.gen.gz",
lambda wildcards: PROCESS_DATA_SOURCE + f"{wildcards.ds}/{wildcards.ds.lower()}_filter.txt",
lambda wildcards: PROCESS_DATA_SOURCE + f"{wildcards.ds}/{wildcards.ds.lower()}.sample"
output:
PROCESS_DATA_SOURCE+"{ds}/chr{chr}_filtered.gen.gz"
run:
# Run command that uses each of the previous files and runs per chromosome
"mycommand.py {input.2} {input.1} {input.0} --out {output}"
All the wildcards used in input need to be present in output. In rule run_command, wildcard {ds_lower} is present only in input but not in output.