I'm working on a bioinformatics pipeline which must be able to run different rules to produce different outputs based on the contents of an input file:
def foo(file):
'''
Function will read the file contents and output a boolean value based on its contents
'''
# Code to read file here...
return bool
rule check_input:
input: "input.txt"
run:
bool = foo("input.txt")
rule bool_is_True:
input: "input.txt"
output: "out1.txt"
run:
# Some code to generate out1.txt. This rule is supposed to run only if foo("input.txt") is true
rule bool_is_False:
input: "input.txt"
output: "out2.txt"
run:
# Some code to generate out2.txt. This rule is supposed to run only if foo("input.txt") is False
How do I write my rules to handle this situation? Also how do I write my first rule all if the output files are unknown before the rule check_input is executed?
Thanks!
You're right, snakemake has to know which files to produce before executing the rules. Therefore, I suggest you use a function which reads what you called "the input file" and define the output of the workflow accordingly.
ex:
def getTargetsFromInput():
targets = list()
## read file and add target files to targets
return targets
rule all:
input: getTargetsFromInput()
...
You can define the path of the input file with --config argument on the snakemake command line or directly use some sort of structured input file (yaml, json) and use the keyword configfile: in the Snakefile: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html
Thanks Eric. I got it work with:
def getTargetsFromInput(file):
with open(file) as f:
line = f.readline()
if line.strip() == "out1":
return "out1.txt"
else:
return "out2.txt"
rule all:
input: getTargetsFromInput("input.txt")
rule out1:
input: "input.txt"
output: "out1.txt"
run: shell("echo 'out1' > out1.txt")
rule out2:
input: "input.txt"
output: "out2.txt"
run: shell("echo 'out2' > out2.txt")
Related
I am trying to create a pipeline that will take a user-configured directory in config.yml (where they have downloaded a project directory of .fastq.gz files from BaseSpace), to run fastqc on sequence files. I already have the downstream steps of merging the fastqs by lane and running fastqc on the merged files.
However, the wildcards are giving me problems running fastqc on the original basespace files. The following is my error when I try running snakemake.
Missing input files for rule all:
qc/fastqc_premerge/DEX-13_S9_L001_ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b_r1_fastqc.zip
qc/fastqc_premerge/BOMB-3-2-19D_S8_L002_ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e_r1_fastqc.zip
qc/fastqc_premerge/DEX-13_S9_L002_ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59_r1_fastqc.zip
Any suggestions would be greatly appreciated. Below is minimal code to reproduce this problem.
import glob
configfile: "config.yaml"
wildcard_constraints:
bsdir = '\w+_L\d+_ds.\w+',
lanenum = '\d+'
inputdirectory=config["directory"]
DIRECTORY, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{bsdir}/{sample}_L{lanenum}_R1_001.fastq.gz")
DIRECTORY, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{bsdir}/{sample}_L{lanenum}_R2_001.fastq.gz")
##### target rules #####
rule all:
input:
#expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)
expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', zip, sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS) ##Changed to this from commenters suggestion, however, snakemake still wont run
rule fastqc_premerge_r1:
input:
f"{config['directory']}/{{bsdir}}/{{sample}}_L{{lanenum}}_R1_001.fastq.gz"
output:
html="qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1.html",
zip="qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: ""
log:
"logs/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1.log"
threads: 1
wrapper:
"v0.69.0/bio/fastqc"
Directory structure:
ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b/DEX-13_S9_L001_R1_001.fastq.gz
ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b/DEX-13_S9_L001_R2_001.fastq.gz
ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59/DEX-13_S9_L002_R1_001.fastq.gz
ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59/DEX-13_S9_L002_R2_001.fastq.gz
ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e/BOMB-3-2-19D_S8_L002_R1_001.fastq.gz
ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e/BOMB-3-2-19D_S8_L002_R2_001.fastq.gz
In this above case, I would like to run fastqc on all 6 input R1/R2 files, then downstream, create a merged file for DEX_13_S9 (for the two inputs to merge) and BOMB-3_2_19D (which will be a copy of the 1 input). Then create 4 fastqc reports on these resulting R1 and R2 files.
EDIT: I had to change the following to get snakemake to run
inputdirectory=config["directory"]
PROJECTDIR, RANDOMINT, LANENUM1, BSSTRINGS, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{proj}-{randint}_L{lanenum1}_ds.{bsstring}/{sample}_L{lanenum}_R1_001.fastq.gz", followlinks=True)
PROJECTDIR, RANDOMINT, LANENUM1, BSSTRINGS, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{proj}-{randint}_L{lanenum1}_ds.{bsstring}/{sample}_L{lanenum}_R2_001.fastq.gz", followlinks=True)
##### target rules #####
rule all:
input:
"qc/multiqc_report_premerge.html"
rule fastqc_premerge_r1:
input:
f"{config['directory']}/{{proj}}-{{randint}}_L{{lanenum1}}_ds.{{bsstring}}/{{sample}}_L{{lanenum}}_R1_001.fastq.gz"
output:
html="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1.html",
zip="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc
params: ""
log:
"logs/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1.log"
threads: 1
wrapper:
"v0.69.0/bio/fastqc"
rule fastqc_premerge_r2:
input:
f"{config['directory']}/{{proj}}-{{randint}}_L{{lanenum1}}_ds.{{bsstring}}/{{sample}}_L{{lanenum}}_R2_001.fastq.gz"
output:
html="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2.html",
zip="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc
params: ""
log:
"logs/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2.log"
threads: 1
wrapper:
"v0.69.0/bio/fastqc"
rule multiqc_pre:
input:
expand("qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1_fastqc.zip", zip, sample=SAMPLES, lanenum=LANENUMS, proj=PROJECTDIR, randint=RANDOMINT, lanenum1=LANENUM1, bsstring=BSSTRINGS),
expand("qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2_fastqc.zip", zip, sample=SAMPLES, lanenum=LANENUMS, proj=PROJECTDIR, randint=RANDOMINT, lanenum1=LANENUM1, bsstring=BSSTRINGS)
output:
"qc/multiqc_report_premerge.html"
log:
"logs/multiqc_premerge.log"
wrapper:
"0.62.0/bio/multiqc"
In your rule all you have:
expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)
This should generate all combinations of SAMPLES, DIRECTORY, and LANENUMS. Is this what you want? I suspect not since it means that all samples are in all directories and they are on all lanes. Maybe you want the zip function to expand the list:
expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', zip, sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)
It's telling you what files are missing; that's what the lines under "missing input files for rule all" are.
That being said, to answer your original question, if you do a dry run, that should tell you what the input/output files are for each planned rule you want to run (use flags -n -r) in your run command.
I have an array xx = [1,2,3] and I want to use Snakemake to create a list of (empty) files 1.txt, 2.txt, 3.txt.
This is the Snakefile I use:
xx = [1,2,3]
rule makefiles:
output: expand("{f}.txt", f=xx)
run:
with open(output, 'w') as file:
file.write('blank')
However instead of having three new shiny text files in my folder I see an error message:
expected str, bytes or os.PathLike object, not OutputFiles
Not sure what I am doing wrong.
Iterate output to get filenames and then write to them. See relevant documentation here.
rule makefiles:
output: expand("{f}.txt", f=xx)
run:
for f in output:
with open(f, 'w') as file:
file.write('blank')
Rewriting above rule, to parallelize, by defining target files in rule all:
rule all:
expand("{f}.txt", f=xx)
rule makefiles:
output:
"{f}.txt"
run:
with open(output[0], 'w') as file:
file.write('blank')
I want to download the fastq files from SRA database using SRR ID using Snakemake. I read a file to get SRR ID using python code.
I want to parse the Variable one by one as input. My code is below.
I want to run command
fastq-dump SRR390728
#SAMPLES = ['SRR390728','SRR400816']
SAMPLES = [line.strip() for line in open("./srrList", 'r')]
rule all:
input:
expand("fastq/{sample}.fastq.log",sample=SAMPLES)
rule download_fastq:
input:
"{sample}"
output:
"fastq/{sample}.fastq.log"
shell:
"fastq-dump {input} > {output}"
Skip input and just call the wildcard in shell command. input needs to be a filepath that needs to already exist or be created as part of the pipeline - neither are true in your case.
rule download_fastq:
output:
"fastq/{sample}.fastq.log"
shell:
"fastq-dump {wildcards.sample} > {output}"
I have multiple studies and I must make two files (a .notsad and .txt file) for each of the n number of studies. After these are created, I must run a command which runs per chromosome and uses the same two input files (.notsad, .txt) for each chromosome within a given study. So:
mycommand.py study1.notsad study1_filter.txt chr1.bad.gz --out chr1_filter.bad.gz
mycommand.py study1.notsad study1_filter.txt chr2.bad.gz --out chr2_filter.bad.gz
...
mycommand.py study2.notsad study2_filter.txt chr1.bad.gz --out chr1_filter.bad.gz
...
However Im having trouble getting this to run. Im getting an error:
WildcardError in line 33 of /scripts/Snakefile:
Wildcards in input files cannot be determined from output files:
'ds_lower'
My rules so far:
import os
import glob
ROOT = "/rootdir/"
ORIGINAL_DATA_FOLDER="original/"
PROCESS_DATA_FOLDER="process/"
ORIGINAL_DATA_SOURCE=ROOT+ORIGINAL_DATA_FOLDER
PROCESS_DATA_SOURCE=ROOT+PROCESS_DATA_FOLDER
DATASETS = [name for name in os.listdir(ORIGINAL_DATA_SOURCE) if os.path.isdir(os.path.join(ORIGINAL_DATA_SOURCE, name))]
LOWERCASE_DATASETS = [dataset.lower() for dataset in DATASETS]
CHROMOSOME = list(range(1,23))
rule all:
input:
expand(PROCESS_DATA_SOURCE+"{ds}/chr{chr}_filtered.gen.gz", ds=DATASETS, chr=CHROMOSOME)
rule run_command:
input:
ORIGINAL_DATA_SOURCE+"{ds}/chr{chr}.bad.gz", # Matches 22 chroms
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}_filter.txt", # But this should be common to all chr runs for this study.
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}.notsad" # This one as well.
output:
PROCESS_DATA_SOURCE+"{ds}/chr{chr}_filtered.gen.gz"
shell:
# Run command that uses each of the previous files and runs per chromosome
"mycommand.py {input.2} {input.1} {input.0} --out {output}"
rule write_txt_file:
input:
ORIGINAL_DATA_SOURCE+"{ds}/{ds_lower}_info.txt"
output:
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}_filter.txt"
shell:
"touch {output}"
rule write_notsad_file:
input:
ORIGINAL_DATA_SOURCE+"{ds}/_{ds_lower}.sad"
output:
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}.notsad"
shell:
"touch {output}"
UPDATE
Changing inputs for rule run_command to lambda functions did work.
rule run_command:
input:
ORIGINAL_DATA_SOURCE+"{ds}/chr{chr}.gen.gz",
lambda wildcards: PROCESS_DATA_SOURCE + f"{wildcards.ds}/{wildcards.ds.lower()}_filter.txt",
lambda wildcards: PROCESS_DATA_SOURCE + f"{wildcards.ds}/{wildcards.ds.lower()}.sample"
output:
PROCESS_DATA_SOURCE+"{ds}/chr{chr}_filtered.gen.gz"
run:
# Run command that uses each of the previous files and runs per chromosome
"mycommand.py {input.2} {input.1} {input.0} --out {output}"
All the wildcards used in input need to be present in output. In rule run_command, wildcard {ds_lower} is present only in input but not in output.
I'm using the following config file format in snakemake for a some sequencing analysis practice (I have loads of samples each containing 2 fastq files:
samples:
Sample1_XY:
- fastq_files/SRR4356728_1.fastq.gz
- fastq_files/SRR4356728_2.fastq.gz
Sample2_AB:
- fastq_files/SRR6257171_1.fastq.gz
- fastq_files/SRR6257171_2.fastq.gz
I'm using the following rules at the start of my pipeline to run fastqc and for alignment of the fastqc files:
import os
# read config info into this namespace
configfile: "config.yaml"
rule all:
input:
expand("FastQC/{sample}_fastqc.zip", sample=config["samples"]),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/fastq_multiqc.html"
rule fastqc:
input:
sample=lambda wildcards: config['samples'][wildcards.sample]
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{sample}_fastqc.html",
zip="FastQC/{sample}_fastqc.zip"
params: ""
wrapper:
"0.21.0/bio/fastqc"
rule bowtie2:
input:
sample=lambda wildcards: config['samples'][wildcards.sample]
output:
"bam_files/{sample}.bam"
log:
"logs/bowtie2/{sample}.txt"
params:
index=config["index"], # prefix of reference genome index (built with bowtie2-build),
extra=""
threads: 8
wrapper:
"0.21.0/bio/bowtie2/align"
rule multiqc_fastq:
input:
expand("FastQC/{sample}_fastqc.html", sample=config["samples"])
output:
"FastQC/fastq_multiqc.html"
params:
log:
"logs/multiqc.log"
wrapper:
"0.21.0/bio/multiqc"
My issue is with the fastqc rule.
Currently both the fastqc rule and the bowtie2 rule create one output file generated using two inputs SRRXXXXXXX_1.fastq.gz and SRRXXXXXXX_2.fastq.gz.
I need the fastq rule to generate two files, a separate one for each of the fastq.gz files but I'm unsure how to index the config file correctly from the fastqc rule input statement, or how to combine the the expand and wildcards commands to solve this. I can get an individual fastq file by adding [0] or [1] to the end of the input statement, but not both run individually/separately.
I've been messing around trying to get the correct indexing format to access each file separately. The current format is the only one I've managed that allows snakemake -np to generate a job list.
Any tips would be greatly appreciated.
It appears each sample would have two fastq files, and they are named in format ***_1.fastq.gz and ***_2.fastq.gz. In that case, config and code below would work.
config.yaml:
samples:
Sample_A: fastq_files/SRR4356728
Sample_B: fastq_files/SRR6257171
Snakefile:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("FastQC/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/fastq_multiqc.html"
rule fastqc:
input:
sample=lambda wildcards: f"{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{sample}_{num}_fastqc.html",
zip="FastQC/{sample}_{num}_fastqc.zip"
wrapper:
"0.21.0/bio/fastqc"
rule bowtie2:
input:
sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
output:
"bam_files/{sample}.bam"
wrapper:
"0.21.0/bio/bowtie2/align"
rule multiqc_fastq:
input:
expand("FastQC/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
output:
"FastQC/fastq_multiqc.html"
wrapper:
"0.21.0/bio/multiqc"