snakemake - Missing input files for rule all - snakemake

I am trying to create a pipeline that will take a user-configured directory in config.yml (where they have downloaded a project directory of .fastq.gz files from BaseSpace), to run fastqc on sequence files. I already have the downstream steps of merging the fastqs by lane and running fastqc on the merged files.
However, the wildcards are giving me problems running fastqc on the original basespace files. The following is my error when I try running snakemake.
Missing input files for rule all:
qc/fastqc_premerge/DEX-13_S9_L001_ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b_r1_fastqc.zip
qc/fastqc_premerge/BOMB-3-2-19D_S8_L002_ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e_r1_fastqc.zip
qc/fastqc_premerge/DEX-13_S9_L002_ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59_r1_fastqc.zip
Any suggestions would be greatly appreciated. Below is minimal code to reproduce this problem.
import glob
configfile: "config.yaml"
wildcard_constraints:
bsdir = '\w+_L\d+_ds.\w+',
lanenum = '\d+'
inputdirectory=config["directory"]
DIRECTORY, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{bsdir}/{sample}_L{lanenum}_R1_001.fastq.gz")
DIRECTORY, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{bsdir}/{sample}_L{lanenum}_R2_001.fastq.gz")
##### target rules #####
rule all:
input:
#expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)
expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', zip, sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS) ##Changed to this from commenters suggestion, however, snakemake still wont run
rule fastqc_premerge_r1:
input:
f"{config['directory']}/{{bsdir}}/{{sample}}_L{{lanenum}}_R1_001.fastq.gz"
output:
html="qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1.html",
zip="qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: ""
log:
"logs/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1.log"
threads: 1
wrapper:
"v0.69.0/bio/fastqc"
Directory structure:
ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b/DEX-13_S9_L001_R1_001.fastq.gz
ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b/DEX-13_S9_L001_R2_001.fastq.gz
ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59/DEX-13_S9_L002_R1_001.fastq.gz
ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59/DEX-13_S9_L002_R2_001.fastq.gz
ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e/BOMB-3-2-19D_S8_L002_R1_001.fastq.gz
ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e/BOMB-3-2-19D_S8_L002_R2_001.fastq.gz
In this above case, I would like to run fastqc on all 6 input R1/R2 files, then downstream, create a merged file for DEX_13_S9 (for the two inputs to merge) and BOMB-3_2_19D (which will be a copy of the 1 input). Then create 4 fastqc reports on these resulting R1 and R2 files.
EDIT: I had to change the following to get snakemake to run
inputdirectory=config["directory"]
PROJECTDIR, RANDOMINT, LANENUM1, BSSTRINGS, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{proj}-{randint}_L{lanenum1}_ds.{bsstring}/{sample}_L{lanenum}_R1_001.fastq.gz", followlinks=True)
PROJECTDIR, RANDOMINT, LANENUM1, BSSTRINGS, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{proj}-{randint}_L{lanenum1}_ds.{bsstring}/{sample}_L{lanenum}_R2_001.fastq.gz", followlinks=True)
##### target rules #####
rule all:
input:
"qc/multiqc_report_premerge.html"
rule fastqc_premerge_r1:
input:
f"{config['directory']}/{{proj}}-{{randint}}_L{{lanenum1}}_ds.{{bsstring}}/{{sample}}_L{{lanenum}}_R1_001.fastq.gz"
output:
html="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1.html",
zip="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc
params: ""
log:
"logs/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1.log"
threads: 1
wrapper:
"v0.69.0/bio/fastqc"
rule fastqc_premerge_r2:
input:
f"{config['directory']}/{{proj}}-{{randint}}_L{{lanenum1}}_ds.{{bsstring}}/{{sample}}_L{{lanenum}}_R2_001.fastq.gz"
output:
html="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2.html",
zip="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc
params: ""
log:
"logs/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2.log"
threads: 1
wrapper:
"v0.69.0/bio/fastqc"
rule multiqc_pre:
input:
expand("qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1_fastqc.zip", zip, sample=SAMPLES, lanenum=LANENUMS, proj=PROJECTDIR, randint=RANDOMINT, lanenum1=LANENUM1, bsstring=BSSTRINGS),
expand("qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2_fastqc.zip", zip, sample=SAMPLES, lanenum=LANENUMS, proj=PROJECTDIR, randint=RANDOMINT, lanenum1=LANENUM1, bsstring=BSSTRINGS)
output:
"qc/multiqc_report_premerge.html"
log:
"logs/multiqc_premerge.log"
wrapper:
"0.62.0/bio/multiqc"

In your rule all you have:
expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)
This should generate all combinations of SAMPLES, DIRECTORY, and LANENUMS. Is this what you want? I suspect not since it means that all samples are in all directories and they are on all lanes. Maybe you want the zip function to expand the list:
expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', zip, sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)

It's telling you what files are missing; that's what the lines under "missing input files for rule all" are.
That being said, to answer your original question, if you do a dry run, that should tell you what the input/output files are for each planned rule you want to run (use flags -n -r) in your run command.

Related

Workflow always results in "Nothing to do" even when forcing rules

So as the title says I can't bring my workflow to execute anything, except the all rule...
When Executing the all rule it correctly finds all the input files, so the configfile is okay, every path is correct.
when trying to run without additional tags I get
Building DAG of jobs...
Checking status of 0 jobs.
Nothing to be done
Things I tried:
-f rcorrector -> only all rule
filenameR1.fcor_val1.fq -> MissingRuleException (No Typos)
--forceall -> only all rule
Some more fiddling I can't formulate clearly
please Help
from os import path
configfile:"config.yaml"
RNA_DIR = config["RAW_RNA_DIR"]
RESULT_DIR = config["OUTPUT_DIR"]
FILES = glob_wildcards(path.join(RNA_DIR, '{sample}R1.fastq.gz')).sample
############################################################################
rule all:
input:
r1=expand(path.join(RNA_DIR, '{sample}R1.fastq.gz'), sample=FILES),
r2=expand(path.join(RNA_DIR, '{sample}R2.fastq.gz'), sample=FILES)
#############################################################################
rule rcorrector:
input:
r1=path.join(RNA_DIR, '{sample}R1.fastq.gz'),
r2=path.join(RNA_DIR, '{sample}R2.fastq.gz')
output:
o1=path.join(RESULT_DIR, 'trimmed_reads/corrected/{sample}R1.cor.fq'),
o2=path.join(RESULT_DIR, 'trimmed_reads/corrected/{sample}R2.cor.fq')
#group: "cleaning"
threads: 8
params: "-t {threads}"
envmodules:
"bio/Rcorrector/1.0.4-foss-2019a"
script:
"scripts/Rcorrector.py"
############################################################################
rule FilterUncorrectabledPEfastq:
input:
r1=path.join(RESULT_DIR, 'trimmed_reads/corrected/{sample}R1.cor.fq'),
r2=path.join(RESULT_DIR, 'trimmed_reads/corrected/{sample}R2.cor.fq')
output:
o1=path.join(RESULT_DIR, "trimmed_reads/filtered/{sample}R1.fcor.fq"),
o2=path.join(RESULT_DIR, "trimmed_reads/filtered/{sample}R2.fcor.fq")
#group: "cleaning"
envmodules:
"bio/Jellyfish/2.2.6-foss-2017a",
"lang/Python/2.7.13-foss-2017a"
#TODO: load as module
script:
"/scripts/filterUncorrectable.py"
#############################################################################
rule trim_galore:
input:
r1=path.join(RESULT_DIR, "trimmed_reads/filtered/{sample}R1.fcor.fq"),
r2=path.join(RESULT_DIR, "trimmed_reads/filtered/{sample}R2.fcor.fq")
output:
o1=path.join(RESULT_DIR, "trimmed_reads/{sample}.fcor_val1.fq"),
o2=path.join(RESULT_DIR, "trimmed_reads/{sample}.fcor_val2.fq")
threads: 8
#group: "cleaning"
envmodules:
"bio/Trim_Galore/0.6.5-foss-2019a-Python-3.7.4"
params:
"--paired --retain_unpaired --phred33 --length 36 -q 5 --stringency 1 -e 0.1 -j {threads}"
script:
"scripts/trim_galore.py"
In snakemake, you define final output files of the pipeline as target files and define them as inputs in first rule of the pipeline. This rule is traditionally named as all (more recently as targets in snakemake doc).
In your code, rule all specifies input files of the pipeline, which already exists, and therefore snakemake doesn't see anything to do. It just instead needs to specify output files of interest from the pipeline.
rule all:
input:
expand(path.join(RESULT_DIR, "trimmed_reads/{sample}.fcor_val{read}.fq"), sample=FILES, read=[1,2]),
Why your attempted methods didn't work?
-f not working:
As per doc:
--force, -f
Force the execution of the selected target or the first rule regardless of already created output.
Default: False
In your code, this means rule all, which doesn't have output defined, and therefore nothing happened.
filenameR1.fcor_val1.fq
This doesn't match output of any of the rules and therefore the error MissingRuleException.
--forceall
Same reasoning as that for -f flag in your case.
--forceall, -F
Force the execution of the selected (or the first) rule and all rules it is dependent on regardless of already created output.
Default: False

Snakemake shell command should only take one file at a time, but it's trying to do multiple files at once

First off, I'm sorry if I'm not explaining my problem clearly, English is not my native language.
I'm trying to make a snakemake rule that takes a fastq file and filters it with a program called Filtlong. I have multiple fastq files on which I want to run this rule and it should output a filtered file per fastq file but apparently it takes all of the fastq files as input for a single Filtlong command.
The fastq files are in separate directories and the snakemake rule should write the filtered files to separate directories aswell.
This is how my code looks right now:
from os import listdir
configfile: "config.yaml"
DATA = config["DATA"]
SAMPLES = listdir(config["RAW_DATA"])
RAW_DATA = config["RAW_DATA"]
FILT_DIR = config["FILTERED_DIR"]
rule all:
input:
expand("{FILT_DIR}/{sample}/{sample}_filtered.fastq.gz", FILT_DIR=FILT_DIR, sample=SAMPLES)
rule filter_reads:
input:
expand("{RAW_DATA}/{sample}/{sample}.fastq", sample=SAMPLES, RAW_DATA=RAW_DATA)
output:
"{FILT_DIR}/{sample}/{sample}_filtered.fastq.gz"
shell:
"filtlong --keep_percent 90 --target_bases 300000000 {input} | gzip > {output}"
And this is the config file:
DATA:
all_samples
RAW_DATA:
all_samples/raw_samples
FILTERED_DIR:
all_samples/filtered_samples
The separate directories with the fastq files are in RAW_DATA and the directories with the filtered files should be in FILTERED_DIR,
When I try to run this, I get an error that looks something like this:
Error in rule filter_reads:
jobid: 30
output: all_samples/filtered_samples/cell_18-07-19_barcode10/cell_18-07-19_barcode10_filtered.fastq.gz
shell:
filtlong --keep_percent 90 --target_bases 300000000 all_samples/raw_samples/cell3_barcode11/cell3_barcode11.fastq all_samples/raw_samples/barcode01/barcode01.fastq all_samples/raw_samples/barcode03/barcode03.fastq all_samples/raw_samples/barcode04/barcode04.fastq all_samples/raw_samples/barcode05/barcode05.fastq all_samples/raw_samples/barcode06/barcode06.fastq all_samples/raw_samples/barcode07/barcode07.fastq all_samples/raw_samples/barcode08/barcode08.fastq all_samples/raw_samples/barcode09/barcode09.fastq all_samples/raw_samples/cell3_barcode01/cell3_barcode01.fastq all_samples/raw_samples/cell3_barcode02/cell3_barcode02.fastq all_samples/raw_samples/cell3_barcode03/cell3_barcode03.fastq all_samples/raw_samples/cell3_barcode04/cell3_barcode04.fastq all_samples/raw_samples/cell3_barcode05/cell3_barcode05.fastq all_samples/raw_samples/cell3_barcode06/cell3_barcode06.fastq all_samples/raw_samples/cell3_barcode07/cell3_barcode07.fastq all_samples/raw_samples/cell3_barcode08/cell3_barcode08.fastq all_samples/raw_samples/cell3_barcode09/cell3_barcode09.fastq all_samples/raw_samples/cell3_barcode10/cell3_barcode10.fastq all_samples/raw_samples/cell3_barcode12/cell3_barcode12.fastq all_samples/raw_samples/cell_18-07-19_barcode01/cell_18-07-19_barcode01.fastq all_samples/raw_samples/cell_18-07-19_barcode02/cell_18-07-19_barcode02.fastq all_samples/raw_samples/cell_18-07-19_barcode03/cell_18-07-19_barcode03.fastq all_samples/raw_samples/cell_18-07-19_barcode04/cell_18-07-19_barcode04.fastq all_samples/raw_samples/cell_18-07-19_barcode05/cell_18-07-19_barcode05.fastq all_samples/raw_samples/cell_18-07-19_barcode06/cell_18-07-19_barcode06.fastq all_samples/raw_samples/cell_18-07-19_barcode07/cell_18-07-19_barcode07.fastq all_samples/raw_samples/cell_18-07-19_barcode08/cell_18-07-19_barcode08.fastq all_samples/raw_samples/cell_18-07-19_barcode09/cell_18-07-19_barcode09.fastq all_samples/raw_samples/cell_18-07-19_barcode10/cell_18-07-19_barcode10.fastq all_samples/raw_samples/cell_18-07-19_barcode11/cell_18-07-19_barcode11.fastq all_samples/raw_samples/cell_18-07-19_barcode12/cell_18-07-19_barcode12.fastq all_samples/raw_samples/cell_18-07-19_barcode13/cell_18-07-19_barcode13.fastq all_samples/raw_samples/cell_18-07-19_barcode14/cell_18-07-19_barcode14.fastq all_samples/raw_samples/cell_18-07-19_barcode15/cell_18-07-19_barcode15.fastq all_samples/raw_samples/cell_18-07-19_barcode16/cell_18-07-19_barcode16.fastq all_samples/raw_samples/cell_18-07-19_barcode17/cell_18-07-19_barcode17.fastq all_samples/raw_samples/cell_18-07-19_barcode18/cell_18-07-19_barcode18.fastq all_samples/raw_samples/cell_18-07-19_barcode19/cell_18-07-19_barcode19.fastq | gzip > all_samples/filtered_samples/cell_18-07-19_barcode10/cell_18-07-19_barcode10_filtered.fastq.gz
(exited with non-zero exit code)
As far as I can tell, the rule takes all of the fastq files as input for a single Filtlong command, but I don't quite understand why
You shouldn't use the expand function in your input section of the filter_reads rule. What you are doing now is requiring all your samples to be the input of each filtered file: that is what you can observe in your error message.
There is another complication that you introduce out of nothing: you mix both wildcards and variables. In your example the {FILT_DIR} is just a predefined value while the {sample} is a wildcard that Snakemake uses to match the rules. Try the following (pay special attention on single/double brackets and on the formatted string (the one that has the form f"")):
rule filter_reads:
input:
f"{RAW_DATA}/{{sample}}/{{sample}}.fastq"
output:
f"{FILT_DIR}/{{sample}}/{{sample}}_filtered.fastq.gz"
shell:
"filtlong --keep_percent 90 --target_bases 300000000 {input} | gzip > {output}"

snakemake with prefix as output including a path

How can I make sure in rule all that the output folder was well created?
Should I add each expected result file?
somehow relates to snakemake define folder as output but in my case the specified 'output' is a combination of a path to a dir and a prefix for all results files (they wil be multiple)
the following command creates a folder path Analysis/MosDepth and adds to that path the files:
gt0.mosdepth.global.dist.txt
gt0.mosdepth.region.dist.txt
gt0.per-base.bed.gz
gt0.per-base.bed.gz.csi
gt0.regions.bed.gz
gt0.regions.bed.gz.csi
rule MosDepth:
input:
bam = "Analysis/Minimap2/"+UnpackedRawFastq+".bam",
bed = "ReferenceData/"+UnpackedGenomeGFF+"_exons.bed"
output:
pfx = "Analysis/MosDepth/gt0"
threads: config["threads"]
shell:
"mosdepth -t {threads} -b {input.bed} {output.pfx} {input.bam}"
I currently have only one of the files in rule all:, is this enough or is there a better way to ensure that the mosdepth has run well and not redo it in a later re-run?
rule all:
input:
"Analysis/MosDepth/gt0.regions.bed.gz"
I would recommend sth like this:
mos_out = ['gt0.mosdepth.global.dist.txt', 'gt0.mosdepth.region.dist.txt', 'gt0.per-base.bed.gz', 'gt0.per-base.bed.gz.csi', 'gt0.regions.bed.gz', 'gt0.regions.bed.gz.csi']
rule MosDepth:
input:
bam = "Analysis/Minimap2/"+UnpackedRawFastq+".bam",
bed = "ReferenceData/"+UnpackedGenomeGFF+"_exons.bed"
output:
expand("Analysis/MosDepth/{mos_out}", mos_out=mos_out)
params:
pfx = "Analysis/MosDepth/gt0"
threads: config["threads"]
shell:
"mosdepth -t {threads} -b {input.bed} {params.pfx} {input.bam}"
If one of the output files is not created by the rule, snakemake will remove all the output files for you, and throw an error.

Snakemake run rule with wildcarded output of previous rules multiple times

I have multiple studies and I must make two files (a .notsad and .txt file) for each of the n number of studies. After these are created, I must run a command which runs per chromosome and uses the same two input files (.notsad, .txt) for each chromosome within a given study. So:
mycommand.py study1.notsad study1_filter.txt chr1.bad.gz --out chr1_filter.bad.gz
mycommand.py study1.notsad study1_filter.txt chr2.bad.gz --out chr2_filter.bad.gz
...
mycommand.py study2.notsad study2_filter.txt chr1.bad.gz --out chr1_filter.bad.gz
...
However Im having trouble getting this to run. Im getting an error:
WildcardError in line 33 of /scripts/Snakefile:
Wildcards in input files cannot be determined from output files:
'ds_lower'
My rules so far:
import os
import glob
ROOT = "/rootdir/"
ORIGINAL_DATA_FOLDER="original/"
PROCESS_DATA_FOLDER="process/"
ORIGINAL_DATA_SOURCE=ROOT+ORIGINAL_DATA_FOLDER
PROCESS_DATA_SOURCE=ROOT+PROCESS_DATA_FOLDER
DATASETS = [name for name in os.listdir(ORIGINAL_DATA_SOURCE) if os.path.isdir(os.path.join(ORIGINAL_DATA_SOURCE, name))]
LOWERCASE_DATASETS = [dataset.lower() for dataset in DATASETS]
CHROMOSOME = list(range(1,23))
rule all:
input:
expand(PROCESS_DATA_SOURCE+"{ds}/chr{chr}_filtered.gen.gz", ds=DATASETS, chr=CHROMOSOME)
rule run_command:
input:
ORIGINAL_DATA_SOURCE+"{ds}/chr{chr}.bad.gz", # Matches 22 chroms
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}_filter.txt", # But this should be common to all chr runs for this study.
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}.notsad" # This one as well.
output:
PROCESS_DATA_SOURCE+"{ds}/chr{chr}_filtered.gen.gz"
shell:
# Run command that uses each of the previous files and runs per chromosome
"mycommand.py {input.2} {input.1} {input.0} --out {output}"
rule write_txt_file:
input:
ORIGINAL_DATA_SOURCE+"{ds}/{ds_lower}_info.txt"
output:
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}_filter.txt"
shell:
"touch {output}"
rule write_notsad_file:
input:
ORIGINAL_DATA_SOURCE+"{ds}/_{ds_lower}.sad"
output:
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}.notsad"
shell:
"touch {output}"
UPDATE
Changing inputs for rule run_command to lambda functions did work.
rule run_command:
input:
ORIGINAL_DATA_SOURCE+"{ds}/chr{chr}.gen.gz",
lambda wildcards: PROCESS_DATA_SOURCE + f"{wildcards.ds}/{wildcards.ds.lower()}_filter.txt",
lambda wildcards: PROCESS_DATA_SOURCE + f"{wildcards.ds}/{wildcards.ds.lower()}.sample"
output:
PROCESS_DATA_SOURCE+"{ds}/chr{chr}_filtered.gen.gz"
run:
# Run command that uses each of the previous files and runs per chromosome
"mycommand.py {input.2} {input.1} {input.0} --out {output}"
All the wildcards used in input need to be present in output. In rule run_command, wildcard {ds_lower} is present only in input but not in output.

Snakemake: How to use config file efficiently

I'm using the following config file format in snakemake for a some sequencing analysis practice (I have loads of samples each containing 2 fastq files:
samples:
Sample1_XY:
- fastq_files/SRR4356728_1.fastq.gz
- fastq_files/SRR4356728_2.fastq.gz
Sample2_AB:
- fastq_files/SRR6257171_1.fastq.gz
- fastq_files/SRR6257171_2.fastq.gz
I'm using the following rules at the start of my pipeline to run fastqc and for alignment of the fastqc files:
import os
# read config info into this namespace
configfile: "config.yaml"
rule all:
input:
expand("FastQC/{sample}_fastqc.zip", sample=config["samples"]),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/fastq_multiqc.html"
rule fastqc:
input:
sample=lambda wildcards: config['samples'][wildcards.sample]
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{sample}_fastqc.html",
zip="FastQC/{sample}_fastqc.zip"
params: ""
wrapper:
"0.21.0/bio/fastqc"
rule bowtie2:
input:
sample=lambda wildcards: config['samples'][wildcards.sample]
output:
"bam_files/{sample}.bam"
log:
"logs/bowtie2/{sample}.txt"
params:
index=config["index"], # prefix of reference genome index (built with bowtie2-build),
extra=""
threads: 8
wrapper:
"0.21.0/bio/bowtie2/align"
rule multiqc_fastq:
input:
expand("FastQC/{sample}_fastqc.html", sample=config["samples"])
output:
"FastQC/fastq_multiqc.html"
params:
log:
"logs/multiqc.log"
wrapper:
"0.21.0/bio/multiqc"
My issue is with the fastqc rule.
Currently both the fastqc rule and the bowtie2 rule create one output file generated using two inputs SRRXXXXXXX_1.fastq.gz and SRRXXXXXXX_2.fastq.gz.
I need the fastq rule to generate two files, a separate one for each of the fastq.gz files but I'm unsure how to index the config file correctly from the fastqc rule input statement, or how to combine the the expand and wildcards commands to solve this. I can get an individual fastq file by adding [0] or [1] to the end of the input statement, but not both run individually/separately.
I've been messing around trying to get the correct indexing format to access each file separately. The current format is the only one I've managed that allows snakemake -np to generate a job list.
Any tips would be greatly appreciated.
It appears each sample would have two fastq files, and they are named in format ***_1.fastq.gz and ***_2.fastq.gz. In that case, config and code below would work.
config.yaml:
samples:
Sample_A: fastq_files/SRR4356728
Sample_B: fastq_files/SRR6257171
Snakefile:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("FastQC/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/fastq_multiqc.html"
rule fastqc:
input:
sample=lambda wildcards: f"{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{sample}_{num}_fastqc.html",
zip="FastQC/{sample}_{num}_fastqc.zip"
wrapper:
"0.21.0/bio/fastqc"
rule bowtie2:
input:
sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
output:
"bam_files/{sample}.bam"
wrapper:
"0.21.0/bio/bowtie2/align"
rule multiqc_fastq:
input:
expand("FastQC/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
output:
"FastQC/fastq_multiqc.html"
wrapper:
"0.21.0/bio/multiqc"